<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Preprocessing input data for machine learning by FCA</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Jan</forename><surname>Outrata</surname></persName>
							<email>jan.outrata@upol.cz</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">Palacky University</orgName>
								<address>
									<addrLine>17. listopadu 12</addrLine>
									<postCode>771 46</postCode>
									<settlement>Olomouc, Olomouc</settlement>
									<country>Czech Republic Tř, Czech Republic</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Preprocessing input data for machine learning by FCA</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">96005D67469A92C2FFAB43FD00A233AF</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:11+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The paper presents an utilization of formal concept analysis in input data preprocessing for machine learning. Two preprocessing methods are presented. The first one consists in extending the set of attributes describing objects in input data table by new attributes and the second one consists in replacing the attributes by new attributes. In both methods the new attributes are defined by certain formal concepts computed from input data table. Selected formal concepts are so-called factor concepts obtained by boolean factor analysis, recently described by FCA. The ML method used to demonstrate the ideas is decision tree induction. The experimental evaluation and comparison of performance of decision trees induced from original and preprocessed input data is performed with standard decision tree induction algorithms ID3 and C4.5 on several benchmark datasets.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Formal concept analysis (FCA) if ofted proposed to be used as a method for data preprocessing before the data is processed by another data mining or machine learning method <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b7">8]</ref>. The results produced by these methods indeed depend on the structure of input data. In case of relational data described by objects and their attributes (object-attribute data) the structure of data is defined by the attributes and, more particularly, by dependencies between attributes. Data preprocessing in general then usually consits in transformation of the set of attributes to another set of attributes in order to enable the particular data mining or machine learning method to achieve better results <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>.</p><p>The paper presents a data preprocessing method utilizing formal concept analysis in a way that certain formal concepts are used to create new attributes describing the original objects. Selected formal concepts are so-called factor concepts obtained by boolean factor analysis, recently described by means of FCA in <ref type="bibr" target="#b0">[1]</ref>. First, attributes defined by the concepts are added to the original set of attributes, extending the dimensionality of data. New attributes are supposed to aid the data mining or machine learning method. Second, the original attributes are replaced by the new attributes which usually means the reduction ⋆ Supported by grant no. P202/10/P360 of the Czech Science Foundation of dimensionality of data since the number of factor concepts is usually smaller than the number of original attributes. Here, a main question arises, whether the reduced number of new attributes can better describe the input objects for the subsequent data mining or machine learning method to produce better results.</p><p>There have been several attempts to transform the attribute space in order to improve the results of data mining and machine learning methods. From the variety of these methods we focus on decision tree induction. The most relevant to our paper is are methods known as constructive induction or feature construction <ref type="bibr" target="#b6">[7]</ref>, where new compound attributes are constructed from original attributes as conjunctions and/or disjunctions of the attributes <ref type="bibr" target="#b10">[11]</ref> or arithmetic operations <ref type="bibr" target="#b11">[12]</ref> or the new attributes are expressed in m-of-n form <ref type="bibr" target="#b8">[9]</ref>. An oblique decision tree <ref type="bibr" target="#b9">[10]</ref> is also connected to our approach in a sense that multiple attributes are used in the splitting condition (see section 3.1) instead of single attribute at a time. Typically linear combinations of attributes are looked for, e.g. <ref type="bibr" target="#b1">[2]</ref>. Learning the condition is, however, computationally challenging.</p><p>Interestingly, we have not found any paper solely on this subject utilizing formal concept analysis. There have been several FCA-based approaches on construction of a whole learning model, commonly called lattice-based or conceptbased machine learning approaches, e.g. <ref type="bibr" target="#b5">[6]</ref>, see <ref type="bibr" target="#b2">[3]</ref> for a survey and comparison, but the usage of FCA to transform the attributes and create new attributes to aid another machine learning method is discussed very marginally or not at all. The present paper is thus a move to fill the gap.</p><p>The remainder of the paper is organized as follows. The next section contains preliminaries from FCA and introduction to boolean factor analysis, including the necessary tranformations between attribute and factor spaces. The main part of the paper is section 3 demonstrating the above sketched ideas on selected machine mearning method -decision tree induction. An experimental evaluation on selected data mining and machine learning benchmark datasets is provided in section 4. Finally, section 5 draws the conclusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Formal Concept Analysis</head><p>In this section we summarize basic notions of FCA. For further information we refer to <ref type="bibr" target="#b3">[4]</ref>. An object-attribute data table is identified with a triplet X, Y, I where X is a non-empty set of objects, Y is a non-empty set of attributes, and I ⊆ X × Y is an object-attribute relation. Objects and attributes correspond to table rows and columns, respectively, and x, y ∈ I indicates that object x has attribute y (table entry corresponding to row x and column y contains × or 1; otherwise it contains blank symbol or 0). In terms of FCA, X, Y, I is called a formal context. For every A ⊆ X and B ⊆ Y denote by A ↑ a subset of Y and by B ↓ a subset of X defined as</p><formula xml:id="formula_0">A ↑ = {y ∈ Y | for each x ∈ A : x, y ∈ I}, B ↓ = {x ∈ X | for each y ∈ B : x, y ∈ I}.</formula><p>That is, A ↑ is the set of all attributes from Y shared by all objects from A (and similarly for B ↓ ). A formal concept in X, Y, I is a pair A, B of A ⊆ X and B ⊆ Y satisfying A ↑ = B and B ↓ = A. That is, a formal concept consists of a set A (so-called extent) of objects which are covered by the concept and a set B (so-called intent) of attributes which are covered by the concept such that A is the set of all objects sharing all attributes from B and, conversely, B is the collection of all attributes from Y shared by all objects from A. Formal concepts represent clusters hidden in object-attribute data.</p><p>A set B(X, Y, I) = { A, B | A ↑ = B, B ↓ = A} of all formal concepts in X, Y, I can be equipped with a partial order ≤. The partial order models a subconcept-superconcept hierarchy, e.g. dog ≤ mammal, and is defined by</p><formula xml:id="formula_1">A 1 , B 1 ≤ A 2 , B 2 iff A 1 ⊆ A 2 (iff B 2 ⊆ B 1 ).</formula><p>B(X, Y, I) equipped with ≤ happens to be a complete lattice, called the concept lattice of X, Y, I . The basic structure of concept lattices is described by the so-called basic theorem of concept lattices, see <ref type="bibr" target="#b3">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Boolean Factor Analysis</head><p>Boolean factor analysis is a matrix decomposition method which provides a representation of an object-attribute data matrix by a product of two different matrices, one describing objects by new attributes or factors, and the other describing factors by the original attributes <ref type="bibr" target="#b4">[5]</ref>. Stated as the problem, the aim is to decompose an n × m binary matrix I into a boolean product A • B of an n × k binary matrix A and a k × m binary matrix B with k as small as possible. Thus, instead of m original attributes, one aims to find k new attributes, called factors.</p><p>Recall that a binary (or boolean) matrix is a matrix whose entries are 0 or 1. A boolean matrix product A • B of binary matrices A and B is defined by</p><formula xml:id="formula_2">(A • B) ij = k l=1 A il • B lj ,</formula><p>where denotes maximum and • is the usual product. The interperetations of matrices A and B is: A il = 1 means that factor l applies to object i and B lj = 1 means that attribute j is one of the manifestations of factor l. Then A • B says: "object i has attribute j if and only if there is a factor l such that l applies to i and j is one of the manifestations of l". As an example,    </p><formula xml:id="formula_3">1 1 0 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 0 1     =     1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 0     •     1 1 0 0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0     .</formula><p>The (solution to the) problem of decomposition binary matrices was recently described by means of formal concept analysis <ref type="bibr" target="#b0">[1]</ref>. The description lies in an observation that matrices A and B can be constructed from a set F of formal concepts of I. In particular, if B(X, Y, I) is the concept lattice associated to I, with X = {1, . . . , n} and Y = {1, . . . , m}, and</p><formula xml:id="formula_4">F = { A 1 , B 1 , . . . , A k , B k } ⊆ B(X, Y, I),</formula><p>then for the n × k and k × m matrices A F and B F defined in such a way that the l-th column (A F ) l of A F consists of the characteristic vector of A l and the l-th row (B F ) l of B F consists of the characteristic vector of B l the following universality theorem holds:</p><formula xml:id="formula_5">Theorem 1. For every I there is F ⊆ B(X, Y, I) such that I = A F • B F .</formula><p>Moreover, decompositions using formal concepts as factors are optimal in that they yield the least number of factors possible: </p><formula xml:id="formula_6">I = A F • B F .</formula><p>Formal concepts F in the above theorems are called factor concepts. Each factor concept determines a factor. For the constructive proof of the last theorem, examples and further results, we refer to <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Transformations between attribute and factor spaces</head><p>For every object i we can consider its representations in the m-dimensional Boolean space {0, 1} m of original attributes and in the k-dimensional Boolean space {0, 1} k of factors. In the space of attributes, the vector representing object i is the i-th row of the input data matrix I, and in the space of factors, the vector representing i is the i-th row of the matrix A.</p><p>Natural transformations between the space of attributes and the space of factors is described by the mappings g :</p><formula xml:id="formula_7">{0, 1} m → {0, 1} k and h : {0, 1} k → {0, 1} m defined for P ∈ {0, 1} m and Q ∈ {0, 1} k by (g(P )) l = m j=1 (B lj → P j ),<label>(1)</label></formula><formula xml:id="formula_8">(h(Q)) j = k l=1 (Q l • B lj ),<label>(2)</label></formula><p>for 1 ≤ l ≤ k and 1 ≤ j ≤ m. Here, → denotes the truth function of classical implication (1 → 0 = 0, otherwise 1), • denotes the usual product, and and denote minimum and maximum, respectively. (1) says that the l-th component of g(P ) ∈ {0, 1} k is 1 if and only if for every attribute j, P j = 1 for all positions j for which B lj = 1, i.e. the l-th row of B is included in P . ( <ref type="formula" target="#formula_8">2</ref>) says that the j-th component of h(Q) ∈ {0, 1} m is 1 if and only if there is factor l such that Q l = 1 and B lj = 1, i.e. attribute j is a manifestation of at least one factor from Q.</p><p>For results showing properties and describing the geometry behind the mappings g and h, see <ref type="bibr" target="#b0">[1]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Boolean Factor Analysis and Decision Trees</head><p>The machine learning method which we use in this paper to demonstrate the ideas presented in section 1 is decision tree induction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Decision Trees</head><p>Decision trees represent the most commonly used method in data mining and machine learning <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14]</ref>. A decision tree can be considered as a tree representation of a function over attributes which takes a finite number of values called class labels. The function is partially defined by a set of vectors (objects) of attribute values and the assigned class label, usually depicted by a table. An example function is depicted in Fig. <ref type="figure">1</ref>. The goal is to construct a tree that approximates the function with a desired accuracy. This is called a decision tree induction. An induced decision tree is typically used for classification of objects into classes, based on the objects' attribute values. A good decision tree is supposed to classify well both objects described by the input data table as well as "unseen" objects.</p><p>Each non-leaf tree node of a decision tree is labeled by an attribute, called a splitting attribute for this node. Such a node represents a test, according to which objects covered by the node are split into v subcollections which correspond to v possible outcomes of the test. In the basic setting, the outcomes are represented by values of the splitting attribute. Leaf nodes of the tree represent collections of objects all of which, or the majority of which, have the same class label. An example of a decision tree is depicted in Fig. <ref type="figure" target="#fig_1">4</ref>.</p><p>Many algorithms for the construction of decision trees were proposed in the literature, see e.g. <ref type="bibr" target="#b13">[14]</ref>. A strategy commonly used consists of constructing a decision tree recursively in a top-down fashion, from the root node to the leaves, by successively splitting existing nodes into child nodes based on the splitting attribute. A critical point in this strategy is the selection of splitting attributes in nodes, for which many approaches were proposed. These include the well-known approaches based on entropy measures, Gini index, classification error, or other measures defined in terms of class distribution of the objects before and after splitting, see <ref type="bibr" target="#b13">[14]</ref> for overviews.</p><p>Remark 1. In machine learning, and in decision trees at particular, the input data attributes are very often categorical attributes. To utilize FCA with the input data, we need to transform the categorical attributes to binary attributes because, in its basic setting, FCA works with binary attributes. A transformation of input data which consists in replacing non-binary attributes into binary ones is called conceptual scaling in FCA <ref type="bibr" target="#b3">[4]</ref>. Note that we need not transform the class attribute, i.e. the attribute determining to which class the object belongs, because we transform the input attributes only in our data preprocessing method.</p><p>Throughout this paper, we use input data from Fig. <ref type="figure">1</ref> (top) to illustrate the data preprocessing. The data table contains sample animals described by attributes body temperature, gives birth, fourlegged, hibernates, and mammal, with the last attribute being the class. After an obvious transformation (nominal scaling) of the input attributes, we obtain the data depicted in Fig. <ref type="figure">1 (bottom)</ref>. Boolean factor analysis which we use in our method is applied on data which we obtain after such transformation. For illustration, the decision tree induced from the data is depicted in Fig. <ref type="figure" target="#fig_1">4</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Extending the collection of attributes</head><p>The first approach proposed in our data preprocessing method is the extension of the collection of attributes by new attributes which are created using boolean factor analysis. In praticular, the new attributes are represented by factors obtained from the decomposition of input data table.</p><p>Let I ⊆ X × Y be input data table describing objects X = {x 1 , . . . , x n } by binary attributes Y = {y 1 , . . . , y m }. Considering I as a n × m binary matrix, we find a decomposition I = A • B of I into the n × k matrix A describing objects by factors F = {f 1 , . . . , f k } and k × m matrix B explaining factors F by attributes. The decomposition of example data table in Fig. <ref type="figure">1</ref> is depicted in Fig. <ref type="figure" target="#fig_0">2</ref>. The new collection of attributes Y ′ is then defined to be Y ′ = Y ∪ F and the extended data table</p><formula xml:id="formula_9">I ′ ⊆ X × Y ′ is defined by I ′ ∩ (X × Y ) = I and I ′ ∩ (X × F ) = A.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hence the new collection of attributes is the union of original attributes and factors and the extended data table is the apposition of original data table and the table representing the matrix describing objects by factors. Fig. 3 depicts the extended data table.</head><p>The key part is the decomposition of the original data table. In the decomposition of binary matrices the aim is to find the decomposition with the number of factors as small as possible. However, since the factors, as new attributes, are used in the process of decision tree induction in our application, we are looking</p><formula xml:id="formula_10">0 B B B B @ 0 1 0 1 0 1 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 0 1 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0 1 C C C C A = 0 B B B B @</formula><p>0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0</p><formula xml:id="formula_11">1 C C C C A • 0 B B B B B B @ 0 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 C C C C C C A Fig. 2.</formula><p>Boolean matrix decomposition of input data in Fig. <ref type="figure">1</ref> Name bc bw gn gy fn fy hn hy f1 f2 f3 f4 f5 f6 mammal cat 0 1 0 1 0 1 1 0 0 0 1 0 0 1 yes bat 0 1 0 1 1 0 0 1 0 0 1 0 1 0 yes salamander 1 0 1 0 0 1 0 1 0 0 0 1 0 0 no eagle 0 1 1 0 1 0 1 0 1 0 0 0 0 0 no guppy 1 0 0 1 1 0 1 0 0 1 0 0 0 0 no Fig. <ref type="figure">3</ref>. Extended data table for input data in Fig. <ref type="figure">1</ref> also for the factors which have a good "decision ability", i.e. that the factors are good candidates to be splitting attributes. To compute the decomposition we can use the algorithms presented in <ref type="bibr" target="#b0">[1]</ref>, with modified criterion of optimality of computed factors. In short, the algorithms apply a greedy heuristic approach to search in the space of all formal concepts for the factor concepts which cover the largest area of still uncovered 1s in the input data table. The criterion function of optimality of a factor is thus the "cover ability" of the corresponding factor concept, in particular the number of uncovered 1s in the input data table which are covered by the concept, see <ref type="bibr" target="#b0">[1]</ref>. The function value is, for the purposes of this paper, translated to the interval [0, 1] (with the value of 1 meaning the most optimal) by dividing the value by the total number of still uncovered 1s in the data table.</p><p>The new criterion function c : 2 X×Y → [0, 1] of optimality of factor concept A, B is:</p><formula xml:id="formula_12">c( A, B ) = w • c A ( A, B ) + (1 − w) • c B ( A, B ),<label>(3)</label></formula><p>where c A ( A, B ) ∈ [0, 1] is the original criterion function of the "cover ability" of factor concept A, B , c B ( A, B ) ∈ [0, 1] is a criterion function of the "decision ability" of factor concept A, B and w is a weight of preference among the functions c A and c B . Let us focus on the function c B . The function measures the goodness of the factor, defined by the factor concept, as splitting attribute. As was mentioned in section 3.1, in decision trees, a common approaches to selection of splitting attribute are based on entropy measures. In these approaches, an attribute is the better splitting attribute the lower is the weighted sum of entropies of subcollections of objects after splitting the objects based on the attribute. We thus design the function c B to be such a measure:</p><formula xml:id="formula_13">c B ( A, B ) = 1 − |A| |X| • E(class|A) − log 2 1 |V (class|A)| + |X \ A| |X| • E(class|X \ A) − log 2 1 |V (class|X\A)| ,<label>(4)</label></formula><p>where V (class|A) is the set of class labels assigned to objects A and E(class|A) is the entropy of objects A based on the class defined as usual by:</p><formula xml:id="formula_14">E(class|A) = − l∈V (class|A) p(l|A) • log 2 p(l|A),<label>(5)</label></formula><p>where p(l|A) is the fraction of objects A with assigned class label l. The value of − log 2 1 <ref type="formula" target="#formula_13">4</ref>) is the maximal possible value of entropy of objects A in the case the class labels V (class|A) are assigned to objects A evenly and the purpose of it is to normalize the value of c B to the interval [0, 1]. Note that we put 0 0 = 0 in calculations in (4). Now, having the extended data table I ′ ⊆ X × (Y ∪ F ) containing new attributes F , the decision tree is induced from the extended data table instead of the original data table I. The class labels assigned to objects remain unchanged, see Fig. <ref type="figure">3</ref>. For ilustration, the decision tree induced from data table in Fig. <ref type="figure">3</ref> is depicted in Fig. <ref type="figure" target="#fig_1">4</ref> (right). We can see that the data can be decided by a single attribute, namely, factor f 3 the manifestations of which are original attributes bt warm and gb yes. Factor f 3 , as the combination of the two attributes, is a better splitting attribute in decision tree induction than the two attributes alone.  The resulted decision tree is used as follows. When classifying an object x described by original attributes Y as a vector P x ∈ {0, 1} m in the (original) attribute space, we first need to compute the description of the object by new attributes/factors F as a vector g(P x ) ∈ {0, 1} k in the factor space. This is accomplished by (1) using the matrix B explaining factors in terms of original attributes. The object described by concatenation of P x and g(P x ) is then classified by the decision tree in a usual way.</p><formula xml:id="formula_15">|V (class|A)| in (</formula><p>For instance, an object described by original attributes Y as vector (10011010) Y is described by factors F as vector (010000) F . The object described by concate-nation of these two vectors is classified by class label no by the decision tree in Fig. <ref type="figure" target="#fig_1">4</ref> (right).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Reducing the collection of attributes</head><p>The second approach consits in the replacement of original attributes by factors, i.e. discarding the original data table. Hence the new collection of attributes Y ′ is defined to be Y ′ = F and the new data table</p><formula xml:id="formula_16">I ′ ⊆ X × Y ′ is put to I ′ = A,</formula><p>where A is the n × k binary matrix describing objects by factor resulting from the decomposition I = A • B of input data table I. Hence the new reduced data table for example data in Fig. <ref type="figure">1</ref> is a table depicted in Fig. <ref type="figure">3</ref> restricted to attributes f 1 , . . . , f 6 .</p><p>Since the number of factors is usually smaller than the number of attributes, see <ref type="bibr" target="#b0">[1]</ref>, this transformation usually leads to the reduction of dimensionality of data. However, the transformation of objects from attribute space to the factor space is not an injective mapping. In particular, the mapping g from attribute vectors to factor vectors maps large convex sets of objects to the same points in the factor space, see <ref type="bibr" target="#b0">[1]</ref> for details. Namely, for two distinct objects x 1 , x 2 ∈ X with different attributes, i.e. described by different vectors in the space of attributes, P x1 = P x2 , which have different class labels assigned, class(x 1 ) = class(x 2 ), the representation of both x 1 , x 2 by vectors in the factor space is the same, g(P x1 ) = g(P x2 ).</p><p>Consider the relation ker(g) (the kernel relation of g) describing such a situation. The class [x] ker(g) ∈ X/ker(g) for an object x ∈ X contains objects represented in (original) attribute space which are mapped to the same object x represented in factor space. The class label assigned to each object x ∈ X in the new data table I ′ is the majority class label for the class [x] ker(g) ∈ X/ker(g) defined as follows: a class label l is a majority class label for [x] ker(g) if l is assigned to the most of objects from [x] ker(g) , i.e. if l = class(x 1 ) for x 1 ∈ [x] ker(g) such that for each x ′ ∈ [x] ker(g) it holds:</p><formula xml:id="formula_17">|{x 2 ∈ [x] ker(g) | class(x 2 ) = l}| ≥ |{x 2 ∈ [x] ker(g) | class(x 2 ) = class(x ′ )}|.</formula><p>Finally, the decision tree is induced from the transformed data table I ′ ⊆ X × F , where class labels assigned to each object x ∈ X is the majority class label for the class [x] ker(g) ∈ X/ker(g). Similarily as in the first approach in section 3.2, when classifying an object x described by original attributes Y as a vector P x ∈ {0, 1} m in the (original) attribute space, we first compute the description of the object by factors F as a vector g(P x ) ∈ {0, 1} k in the factor space. The object described by g(P x ) is classified by the decision tree. In our example, the decision tree induced from reduced data table (the table in Fig. <ref type="figure">3</ref> restricted to attributes f 1 , . . . , f 6 ) is the same as the tree induced from the extended data table, i.e. the tree depicted in Fig. <ref type="figure" target="#fig_1">4</ref> (right).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Evaluation</head><p>We performed series of experiments to evaluate our data preprocessing method. The experiments consist in comparing the performance of created machine learning models (e.g. decision trees) induced from original and preprocessed input data. In the comparison we used reference decision tree algorithms ID3 and C4.5 <ref type="bibr" target="#b12">[13]</ref> (entropy and information gain based) and also an instance based learning method (IB1). The algorithms were borrowed and run from Weka<ref type="foot" target="#foot_1">1</ref> , a software package that contains implementations of machine learning and data mining algorithms in Java. Default Weka's parameters were used for the algorithms. The experiments were done on selected public real-world datasets from UCI Machine Learning Repository. The selected datasets are from different areas (medicine, biology, zoology, politics, games). All the datasets contain only categorical attributes with one class label attribute and the datasets were cleared of objects containing missing values. Basic characteristics of the datasets are depicted in Tab. 1. The numbers of attributes are of original categorical attributes and, in brackets, of binary attributes after nominal scaling (see remark 1). The experiments were done using the 10-fold stratified cross-validation test. The following results are of averaging 10 execution runs on each dataset with randomly ordered records.</p><p>Due to the limited scope of the paper we show only the results of data preprocessing by reducing the original attributes to factors and the results for adding the factors to the collection of attributes are postponed to the full version of the paper. The results are depicted in Tab. 2. The tables show ratios of the average percentage rates of correct classifications for preprocessed data and original data, i.e. the values indicate the increase factor of correct classifications for preprocessed data. The values are for both training (upper number in the table cell) and testing (lower number) datasets for each algorithm and dataset being compared, plus the average over all datasets. In the case of top table the criterion of optimality of generated factors (3) was set to the original criterion function of the "cover ability" of factor concept, i.e. the original criterion used in the algorithms from <ref type="bibr" target="#b0">[1]</ref>. This corresponds to setting w = 1 in (3). In the case of bottom table the criterion of optimality of generated factors was changed to the function of the "decision ability" described in section 3.2, i.e. w = 0 in (3). We can see that while not inducing worse learning model at average on training datasets the methods have better performance at average on testing dataset for input data preprocessed by our methods (with the exception of dataset zoo which has more than two values of class attribute). For instance, ID3 method has better performance by 3.8 % (5.4 % without zoo) for criterion of optimality of generated factors being the original criterion function of the "cover ability" of factor concept, while for criterion of optimality of generated factors being the function of the "decision ability" the performance is better by 5.1 % (6.5 % without zoo). The results for adding the factors to the collection of attributes are very similar, with ±1 % difference to the results for reducing the original attributes to factors, with the exception of dataset zoo, where the difference was +4 %.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>We presented two methods of preprocessing input data to machine learning based on formal concept analysis (FCA). In the first method, the collection of attributes describing objects is extended by new attributes while in the second method, the original attributes are replaced by the new attributes. Both methods utilize boolean factor analysis, recently described by FCA, in that the new attributes are defined as factors computed from input data. The number of factors is usually smaller than the number of original attributes. The methods were demonstrated on the induction of decision trees and an experimental evaluation indicates usefullness of such preprocessing of data: the decision trees induced from preprocessed data outperformed decision trees induced from original data for two entropy-based methods ID3 and C4.5.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Theorem 2 .</head><label>2</label><figDesc>Let I = A • B for n × k and k × m binary matrices A and B. Then there exists a set F ⊆ B(X, Y, I) of formal concepts of I with |F| ≤ k such that for the n × |F | and |F | × m binary matrices A F and B F we have</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 4 .</head><label>4</label><figDesc>Fig.<ref type="bibr" target="#b3">4</ref>. The decision trees induced from original data table in Tab. 1 (left) and from extended data table in Tab.3 (right)    </figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>(left).</figDesc><table><row><cell cols="2">Name</cell><cell cols="7">body temp. gives birth fourlegged hibernates mammal</cell><cell></cell></row><row><cell cols="2">cat</cell><cell>warm</cell><cell>yes</cell><cell></cell><cell>yes</cell><cell></cell><cell>no</cell><cell>yes</cell><cell></cell></row><row><cell cols="2">bat</cell><cell>warm</cell><cell>yes</cell><cell></cell><cell>no</cell><cell></cell><cell>yes</cell><cell>yes</cell><cell></cell></row><row><cell cols="2">salamander</cell><cell>cold</cell><cell>no</cell><cell></cell><cell>yes</cell><cell></cell><cell>yes</cell><cell>no</cell><cell></cell></row><row><cell cols="2">eagle</cell><cell>warm</cell><cell>no</cell><cell></cell><cell>no</cell><cell></cell><cell>no</cell><cell>no</cell><cell></cell></row><row><cell cols="2">guppy</cell><cell>cold</cell><cell>yes</cell><cell></cell><cell>no</cell><cell></cell><cell>no</cell><cell>no</cell><cell></cell></row><row><cell>Name</cell><cell cols="9">bt cold bt warm gb no gb yes fl no fl yes hb no hb yes mammal</cell></row><row><cell>cat</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>1</cell><cell>0</cell><cell>yes</cell></row><row><cell>bat</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>1</cell><cell>0</cell><cell>0</cell><cell>1</cell><cell>yes</cell></row><row><cell>salamander</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>no</cell></row><row><cell>eagle</cell><cell>0</cell><cell>1</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>no</cell></row><row><cell>guppy</cell><cell>1</cell><cell>0</cell><cell>0</cell><cell>1</cell><cell>1</cell><cell>0</cell><cell>1</cell><cell>0</cell><cell>no</cell></row></table><note>Fig. 1. Input datatable (top) and corresponding data table for FCA (bottom)</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 .</head><label>1</label><figDesc>Characteristics of datasets used in experiments</figDesc><table><row><cell>Dataset</cell><cell cols="3">No. of attributes (binary) No. of objects Class distribution</cell></row><row><cell>breast-cancer</cell><cell>9(51)</cell><cell>277</cell><cell>196/81</cell></row><row><cell>kr-vs-kp</cell><cell>36(74)</cell><cell>3196</cell><cell>1669/1527</cell></row><row><cell>mushroom</cell><cell>21(125)</cell><cell>5644</cell><cell>3488/2156</cell></row><row><cell>tic-tac-toe</cell><cell>9(27)</cell><cell>958</cell><cell>626/332</cell></row><row><cell>vote</cell><cell>16(32)</cell><cell>232</cell><cell>124/108</cell></row><row><cell>zoo</cell><cell>15(30)</cell><cell>101</cell><cell>41/20/5/13/4/8/10</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 .</head><label>2</label><figDesc>Classification accuracy for datasets from Tab. 1, for w = 1 (top) and w = 0 (bottom table) in<ref type="bibr" target="#b2">(3)</ref> </figDesc><table><row><cell>training % testing %</cell><cell cols="5">breast-cancer kr-vs-kp mushroom tic-tac-toe vote</cell><cell>zoo</cell><cell>average</cell></row><row><cell>ID3</cell><cell>1.020 1.159</cell><cell>1.000 0.993</cell><cell>1.000 1.000</cell><cell>1.000 1.123</cell><cell>1.000 0.993</cell><cell>1.018 0.962</cell><cell>1.006 1.038</cell></row><row><cell>C4.5</cell><cell>1.031 0.989</cell><cell>0.998 0.994</cell><cell>1.000 1.000</cell><cell>1.028 1.092</cell><cell>0.998 0.994</cell><cell>1.006 0.940</cell><cell>1.010 1.002</cell></row><row><cell>IB1</cell><cell>1.020 0.970</cell><cell>1.000 1.017</cell><cell>1.000 1.000</cell><cell>1.000 1.000</cell><cell>1.000 1.005</cell><cell>1.020 0.965</cell><cell>1.007 0.993</cell></row><row><cell>training % testing %</cell><cell cols="5">breast-cancer kr-vs-kp mushroom tic-tac-toe vote</cell><cell>zoo</cell><cell>average</cell></row><row><cell>ID3</cell><cell>1.020 1.153</cell><cell>1.000 1.000</cell><cell>1.000 1.000</cell><cell>1.000 1.157</cell><cell>1.000 1.017</cell><cell>1.018 0.980</cell><cell>1.006 1.051</cell></row><row><cell>C4.5</cell><cell>1.047 1.035</cell><cell>1.000 0.998</cell><cell>1.000 1.000</cell><cell>1.033 1.138</cell><cell>1.000 1.007</cell><cell>1.006 0.958</cell><cell>1.014 1.023</cell></row><row><cell>IB1</cell><cell>1.020 0.951</cell><cell>1.000 1.083</cell><cell>1.000 1.000</cell><cell>1.000 1.213</cell><cell>1.000 1.033</cell><cell>1.020 0.967</cell><cell>1.007 1.041</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">Jan Outrata</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1">Waikato Environment for Knowledge Analysis, available at http://www.cs.waikato.ac.nz/ml/weka/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Discovery of optimal factors in binary data via a novel method of matrix decomposition</title>
		<author>
			<persName><forename type="first">R</forename><surname>Belohlavek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vychodil</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. Comput. System Sci</title>
		<imprint>
			<biblScope unit="volume">76</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="3" to="20" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Classification and Regression Trees</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Friedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Olshen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Stone</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1984">1984</date>
			<publisher>Chapman &amp; Hall</publisher>
			<pubPlace>NY</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A comparative study of FCA-based supervised classification algorithms</title>
		<author>
			<persName><forename type="first">H</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Fu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Njiwoua</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mephu</forename><surname>Nguifo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICFCA 2004</title>
				<meeting>ICFCA 2004</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="volume">2961</biblScope>
			<biblScope unit="page" from="313" to="320" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Formal Concept Analysis</title>
		<author>
			<persName><forename type="first">B</forename><surname>Ganter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wille</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Mathematical Foundations</title>
				<meeting><address><addrLine>Berlin</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Boolean Matrix Theory and Applications</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">H</forename><surname>Kim</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1982">1982</date>
			<publisher>M. Dekker</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Machine learning and formal concept analysis</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">O</forename><surname>Kuznetsov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICFCA 2004</title>
				<meeting>ICFCA 2004</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="volume">2961</biblScope>
			<biblScope unit="page" from="287" to="312" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">A theory and methodology of inductive learning</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">S</forename><surname>Michalski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="111" to="116" />
			<date type="published" when="1983">1983</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">What Can Formal Concept Analysis Do for Data Warehouses?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Missaoui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kwuida</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICFCA 2009</title>
				<meeting>ICFCA 2009</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">5548</biblScope>
			<biblScope unit="page" from="58" to="65" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">ID2-of-3: constructive induction of M-of-N concepts for discriminators in decision trees</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">M</forename><surname>Murphy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Pazzani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Eight Int. Workshop on Machine Learning</title>
				<meeting>of the Eight Int. Workshop on Machine Learning</meeting>
		<imprint>
			<date type="published" when="1991">1991</date>
			<biblScope unit="page" from="183" to="187" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A system for induction of oblique decision trees</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Murthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kasif</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Salzberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">J. of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="1" to="33" />
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Boolean feature discovery in empirical learning</title>
		<author>
			<persName><forename type="first">G</forename><surname>Pagallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Haussler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine Learning</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="71" to="100" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Iterative feature construction for improving inductive learning algorithms</title>
		<author>
			<persName><forename type="first">S</forename><surname>Piramuthu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">T</forename><surname>Sikora</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="3401" to="3406" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">C4.5: Programs for Machine Learning</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Quinlan</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1993">1993</date>
			<publisher>Morgan Kaufmann</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Introduction to Data Mining</title>
		<author>
			<persName><forename type="first">P.-N</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Steinbach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kumar</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2006">2006</date>
			<publisher>Addison Wesley</publisher>
			<pubPlace>Boston, MA</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Formal concept analysis for knowledge discovery and data mining: The new challenges</title>
		<author>
			<persName><forename type="first">P</forename><surname>Valtchev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Missaoui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Godin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. ICFCA 2004</title>
				<meeting>ICFCA 2004</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="volume">2961</biblScope>
			<biblScope unit="page" from="352" to="371" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
