=Paper=
{{Paper
|id=Vol-2644/paper11
|storemode=property
|title=Rule Extraction via Dynamic Discretization with an Application to Air Quality Modelling
|pdfUrl=https://ceur-ws.org/Vol-2644/paper11.pdf
|volume=Vol-2644
|authors=Joanna Kaminska,Estrella Lucena-Sanchez,Guido Sciavicco,Ionel Eduard Stan
|dblpUrl=https://dblp.org/rec/conf/ruleml/KaminskaLSS20
}}
==Rule Extraction via Dynamic Discretization with an Application to Air Quality Modelling==
Rule Extraction via Dynamic Discretization with an Application to Air Quality Modelling? J. Kamińska1[0000−0002−0157−516X] , E. Lucena-Sánchez2,3[0000−0001−9312−1175] , G. Sciavicco4[0000−0002−9221−879X] , and I.E. Stan5,6[0000−0001−9260−102X] 1 Dept. of Math., Wroclaw University of Environmental and Life Sciences joanna.kaminska@upwr.edu.pl 2 Dept. of Phy., Inf., and Math., University of Modena e Reggio Emilia 3 Dept. of Math. and Comp. Sci., University of Ferrara estrella.lucenasanchez@unife.it 4 Dept. of Math. and Comp. Sci., University of Ferrara guido.sciavicco@unife.it 5 Dept. of Math., Phy., and Comp. Sci., University of Parma 6 Dept. of Math. and Comp. Sci., University of Ferrara ioneleduard.stan@unife.it Abstract. Association rule extraction is a very well-known and im- portant problem in machine learning, and especially in the sub-field of explainable machine learning. Association rules are naturally extracted from data sets with Boolean (or at least categorical) attributes. In or- der for rule extraction algorithms to be applicable to data sets with numerical attributes as well, data must be suitably discretized, and a great amount of work has been devoted to finding good discretization algorithms, taking into account that optimal discretization is a NP-hard problem. Motivated by a specific application, in this paper we provide a novel discretization algorithm defined as an (heuristic) optimization problem and solved by an evolutionary algorithm, and we test its perfor- mances against well-known available solutions, proving (experimentally) that we are able to extract more rules in a easier way. Keywords: Association rule extraction · Optimization problem · Evo- lutionary algorithm. 1 Introduction Among the most relevant problems in machine learning there are classification problem and association rule extraction problem. An association rule is an if-then statement that helps to show the probability of certain relationship among the ? This research has been partially supported by the Italian INDAM GNCS project Strategic Reasoning and Automatic Synthesis of Multi-Agent Systems Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). attributes of a data set. While classification is a global problem (e.g., a logis- tic regression model that classifies all instances), association rule extraction is a local problem, that makes it possible for unusual, uncommon relationships to emerge. The first, and most successful, rule extraction algorithm is Agrawal and Srikant’s APRIORI [1], which is originally designed to extract, with a determin- istic approach, meaningful rules from a data set of Boolean attributes. APRIORI has been later generalized to deal with numerical attributes (approaching, in a sense, the discretization problem) in [3], and to extract temporal rules (on a very limited basis) in [2]. The quality of association rules is generally measured by their support (that is, relevance within the data set) and their confidence (that is, precision); depending on the particular application, though, other, more specific quality measures can be introduced. The approaches to the discretization (or binning) problem, or the problem of mining quantitative association rules can be categorized into deterministic (greedy), or non-deterministic (usually based on metaheuristics). Deterministic static approaches simply cut the domain of a continuous attribute into a prede- fined number of bins before the mining process. Agrawal’s solution to discretiza- tion can be considered static. Indeed, it is built on a deterministic procedure that partitions each quantitative attribute one-by-one (i.e., without considering the distribution of the other attributes) into equal-width intervals, and on a modification to the original APRIORI algorithm to handle sets of items that are generalizations of others (e.g., saying that the age of someone varies in the interval [20, 29] is more general than saying that it varies in the interval [20, 24]) and to take into account a newly introduced interestingness measure. Moreover, adjacent intervals can be merged to compensate the low confidence of rules in some cases. Discretization algorithms are available in some open-source learning frameworks that belong to this category and allow both equal-width and equal- frequency discretizations, but are separated from the rule learning phase, which is applied to the resulting, discretized, data set: WEKA’s discretization filter [22], Python’s k-bins discretizer [18], and R’s discretization library [19]. Deterministic distance-based discretization is a possible alternative to static methods, and it is based on clusterization; examples include [4, 13, 15, 21], among others. Non- deterministic discretization, on the other hand, is based on designing an opti- mization, multivariate problem [14, 20]: rules are extracted, and evaluated on a discretized data set that is progressively modified; the main drawback of such an approach is that the rule extraction algorithm must be re-designed. Finally, most of the existing approaches are general purpose, but in many cases the intended applications are represented by data sets with discrete numerical attributes (e.g., number of cars possessed, age, and so on); applying a discretization algorithm for rule extraction from data sets with continuous attributes may require a more specialized approach. In this paper we consider a data set that represents the measurement of several air quality parameters on a specific monitoring station in the center of Wroclaw (Poland) during the interval of three years. Such a data set has been studied in the context of global explanation models (e.g., in [11, 12]), with inter- esting, but possibly incomplete, results; this is usually the case with real-world data, on which one can often expect to be able to extract a general model that leaves room to particular cases. In this paper we aim to extract mean- ingful local rules, which may be useful to model the particular situations that occurred during the measurement period. We propose and test a multivariate, non-deterministic, distribution-based discretization algorithm that can be cou- pled with a pre-existing rule extraction algorithm, such as APRIORI, and imple- mented via an evolutionary algorithm. We prove that our solution produces more rules w.r.t. the particular situations than those produced by the discretization algorithms available in some open-source learning libraries. Finally, we prove that, like most preprocessing task, optimal discretization is NP-hard, which in- dicates that heuristic-based approaches, such as ours, are to be preferred; this result, shown in the Appendix, is not necessary for the understanding of the rest of the paper. The paper is organized as follows. Section 2 presents the needed preliminaries, Section 3 formalizes the dynamic discretization process as an optimization prob- lem together with implementation details, Section 4 discusses the data and the experimental setup, and, finally, Section 5 discusses the results of the proposed method before concluding in Section 6. 2 Background Discretization. The discretization of the numerical attributes of a data set is a preprocessing phase, much like feature selection or outlier detection. Preprocess- ing approaches, which are designed to enable or improve a data mining process on a data set, are classically separated into filters, wrappers, or embedded algo- rithms. A filter, which is by large the most common type, is an algorithm that uses statistical indicators to perform the preprocessing; it is independent from the data mining or model extraction phase, which is applied to the resulting data set. A wrapper is a combined approach, in which the result of the pre- processing phase is continuously evaluated using the performances of a selected data mining algorithm, in search of an optimal solution. Embedded algorithms for preprocessing are integrated data mining algorithms, in which the prepro- cessing is part of the mining process; by nature, embedded algorithms are more difficult to implement, and therefore less common. The most relevant examples of discretization filters are included in some open-source learning frameworks, namely WEKA’s discretization filter [22], Python’s k-bins discretizer [18], and R’s discretization library [19]; the differences among them are mainly related to their implementation, and they tend to behave in a very similar way. An un- supervized discretization filter usually offers two modalities, namely equal-width bins (that is, bins in which the endpoints of the describing interval have all the same distance), and equal-frequency bins (that is, bins that have roughly the same number of instances). Attribute preprocessing algorithms, such as dis- cretization algorithms, can be also separated into univariate (in which attributes are considered one-by-one) and multivariate (in which the decisions are taken for subsets of attributes all together); the most common filters, as those men- tioned above, are all univariate. Among the proposed multivariate filters for discretization, distance-based discretization has been studied in [4, 13, 15, 21], among others. Since optimal preprocessing is usually a computational unfea- sible problem, the proposed deterministic solutions are generally sub-optimal (greedy), and, sometimes, non-deterministic (i.e., heuristic-based) solutions are advocated; however, filters are typically deterministic. In the current literature, non-deterministic solutions for discretization are usually embedded algorithms, in which, as we have recalled, the rule-extraction phase is re-designed to include discretization, as in [3, 14, 17, 20]. Thus, summarizing, discretization filters are mostly univariate and greedy, while embedded discretization algorithms require the implementation of an ad-hoc rule extraction method. Multiobjective optimization. A multiobjective optimization problem (see, e.g. [6]) can be formally defined as the optimization problem of simultaneously minimizing (or maximizing) a set of t arbitrary functions: min / max f1 (x̄) min / max f2 (x̄) (1) ... min / max ft (x̄), where x̄ is a vector of decision variables. A multiobjective optimization problem can be continuous, in which we look for real values, or combinatorial, where we look for objects from a countably (in)finite set, typically integers, permutations, or graphs. Maximization and minimization problems can be reduced to each other, so that it is sufficient to consider one type only. A set F of solutions for a multiobjective problem is non-dominated (or Pareto optimal ) if and only if for each x̄ ∈ F, there exists no ȳ ∈ F such that (i) there exists i (1 ≤ i ≤ t) that fi (ȳ) improves over fi (x̄), and (ii) for every j, (1 ≤ j ≤ t, j 6= i), fj (x̄) does not improve fi (ȳ). In other words, a solution x̄ dominates a solution ȳ if and only if x̄ is better than ȳ in at least one objective, and it is not worse than ȳ in the remaining objectives. We say that x̄ is non-dominated if and only if there is no other solution that dominates it. The set of non-dominated solutions from F is called Pareto front. Optimization problems can be approached in several ways; among them, multiobjective evolutionary algorithms are a popular choice (see, e.g. [9, 10, 16]). Wrapper preprocessing algorithms can be easily defined as multiobjective optimization problems, as it is the case, for example, for feature selection; in this paper we show how discretization for rule extraction can follow the same pattern. 3 Dynamic Discretization Rule extraction. For a given categorical data set A, with categorical attributes A = {A1 , . . . , An }, an association rule ρ is an object of the type: ρ : γ1 ∧ . . . ∧ γs ⇒ γs+1 where s < n and γi belongs to the domain of Ai , for each i. As defined by Agrawal and Srikant [1], a rule such as ρ can be evaluated by its support, that is, the fraction of the data set in which every instance is characterized by having both γ1 , . . . , γs (antecedent) and γs+1 (consequent), and its confidence, that is, the ratio between the support and the subset of instances with γ1 , . . . , γs . So, the greater the support the more influential is the rule in the data set, and the greater the confidence the greater is its precision; in general, we search for meaningful rules (e.g., particular situations) with a high confidence. Discretizing a numerical data set allows one to extract association rules from it, as it converts numerical attributes into categorical ones. In the following, we refer to the categories of a categorical attribute as labels. Binning. Given a numerical attribute, the first step to dynamic discretization is defining a parametric function that returns the bins. Thus, let now A be a numerical data set with attributes A = {A1 , . . . , An }: a11 a12 . . . a1n a21 a22 . . . a2n A= ... ... ... ... am1 am2 . . . amn For a variable Ai , we denote by µi its mean in A, and by σi its standard deviation; moreover, let ki ∈ (0, 1] be a displacement parameter. For a given number li of expected bins in which we want to discretize the domain of Ai , let γi1 , . . . , γili be the corresponding labels. Now, let v(Ai ) be a certain value of the variable Ai ; by writing v(Ai ) ∈ γip we indicate that the interval/bin denoted by the label γip includes the value v(Ai ). Then, we set that v(Ai ) ∈ γip if: v(Ai ) < µi − bli /2cki σi if p = 1 µi − (dli /2e − p + 1)ki σi ≤ v(Ai ) < µi − (dli /2e − p)ki σi if 1 < p < dli /2e µi − ki σi ≤ v(Ai ) ≤ µi + ki σi if p = dli /2e µ + (p − dl /2e)k σ < v(A ) ≤ µ + (p − dl /2e + 1)k σ if dli /2e < p < li i i i i i i i i i µi + bli /2cki σi < v(Ai ) if p = li (2) for an odd li , and if: v(Ai ) ≤ µi − (li /2 − p)ki σi if p = 1 µ − (l /2 − p + 1)k σ < v(A ) ≤ µ − (l /2 − p)ki i if 1 < p < li /2 σ i i i i i i i µi − (li /2 − p + 1)ki σi < v(Ai ) ≤ µi if p = li /2 µi < v(A i ) ≤ µi + (p − li /2)k σ i i if p = li /2 + 1 µ + (p − l /2 − 1)k σ < v(A ) ≤ µ + (p − l /2)ki i if li /2 + 1 < p < li σ i i i i i i i µi + (p − li /2 − 1)ki σi < v(Ai ) if p = li (3) for an even li . In Figure 1 (top) we graphically represented the transformation for a given µi , σi , and ki , and for the odd case with li = 5. Intuitively, ki represents how far an endpoint of a bin is from the mean (µi ) in terms of standard µi − 2ki σi µi − ki σi µi µi + k i σ i µi + 2ki σi Solution: k1 l1 . . . ki li . . . kn ln Optimization parameters: ki ∈ (0, 1] ⊂ R displacement li ∈ [2, . . . , maxLabels] ⊂ N number of labels Fig. 1. Graphical representation of the discretization function (2) with li = 5 for some variable (top) and its encoding in the evolutionary algorithm for all variables (bottom). deviations (σi ). Applying it to the original numerical data set clearly leads to a categorical one from which rules can be extracted: γ11 γ12 . . . γ1n γ21 γ22 . . . γ2n A0 = ... ... ... ... γm1 γm2 . . . γmn For variables following the normal (Gaussian) distribution, this algorithm is a simple generalization of the common equal-width binning (with a particular configuration for li and ki ); yet, as we shall see, for non-normally distributed variables our technique often returns more significant bins. Binning as an optimization problem. Our purpose is to devise a good dis- cretization of a numerical data set using our distribution-based technique, that is, to learn a good pair of values (ki , li ), for each numerical attribute Ai . Following the taxonomy that we have identified in Section 2, we design our discretiza- tion algorithm as a wrapper, where APRIORI is used as a black-box. Because of the nature of the classical implementation of APRIORI, support and confi- dence must be set beforehand; therefore, optimizing these parameters may be less meaningful. On the other hand, we are not just interested in precise rules, but also in meaningful ones w.r.t. to specific domain-application (e.g., particu- lar situations). As pointed out in [3], discretizing a numerical data set entails a notion of interestingness. In particular, we claim that the less number of cate- gories, assuming that there are at least two categories per attribute, the more interesting are the rules, because, to some extent, having less labels implies that it is easier to interpret the discovered rules. Consistently with this observation, let x̄ = [(x11 , x21 ), (x12 , x22 ), . . . , (x1n , x2n )] be a vector of decision variables, each one of which is pair that represents the displacement ki and the number of bins li of the attribute Ai (cfr. Figure 1, bottom). Let L(x̄) denote the collection of all li s of a given vector x̄, and let M axbins(x̄) (resp., M eanbins(x)) be the maximum value (resp., the average) in L(x̄). Also, let Card(x̄) (resp., M eanconf (x̄)) be the number of rules (resp., the average confidence of the rules) that emerge from an application of APRIORI over a numerical data set with n attributes discretized using (2) and (3), for a given minimum support s and minimal confidence c. We can instantiate (1) with four possible combinations of objectives: min M axbins(x̄) min M eanbins(x̄) aa (4)aa aa (5)aa max Card(x̄) max Card(x̄) min M axbins(x̄) min M eanbins(x̄) aa (6)aa aa (7)aa max M eanconf (x̄) max M eanconf (x̄) Observe that minimizing the number of labels goes in favour of interpretable rules, while having at least two (non-empty) labels per attribute goes in favour of having meaningful rules. Implementation. Multiobjective evolutionary algorithms are known to be par- ticularly suitable to perform multiobjective optimization, as they search for multiple optimal solutions in parallel. In this experiment we have chosen the well-known NSGA-II (Non-dominated Sorted Genetic Algorithm) [7] algorithm, which is available as open-source from the jMetal suite [8]. NGSA-II is an eli- tist Pareto-based multiobjective evolutionary algorithm that employs a strategy with a binary tournament selection and a rank-crowding better function, where the rank of an individual in a population is the non-domination level of the in- dividual in the whole population. As black-box rule extraction method we used the class Apriori from WEKA. We have represented each solution with an array of pairs: (k1 , l1 ), (k2 , l2 ), . . . , (kn , ln ) in which, as in Figure 1 (bottom), 2 ≤ li ≤ maxLabels, where maxLabels is a predetermined parameter, and ki ∈ (0, 1]. The initial population is generated randomly. Mutation and Crossover were slightly modified to take into account the particular nature of our solutions. In particular, a mutation is performed, randomly with the same probability, over the first or the second component of the chosen attribute; the displacement is either substituted by random element in (0, 1] or incremented or subtracted by the value 0.1, while the number of labels is either substituted by a random (integer) element in [2, maxLabels] or incremented or subtracted by 1. Similarly, a crossover is performed at the level of the pair (ki , li ), that is, given two individuals to be crossed, two random pairs are chosen for each one of them and cross-exchanged. This particular configuration for performing both Mutation and Crossover is one among many others; the systematic study of other configurations is beyond the scope of this paper. Variable Unit Mean St.Dev. Min Median Max ◦ Air temp. (a) C 10.9 8.4 -15.7 10.1 37.7 Solar dur. (d) h 0.23 0.38 0 0 1 Wind speed (w) ms−1 3.13 1.95 0 3.00 19 % Rel. hum. (r) − 74.9 17.3 20 79.0 100 Air pressure (p) hP a 1003 8.5 906 1003 1028 Traffic (t) − 2771 1795.0 30 3178 6713 N O2 µgm−3 50.4 23.2 1.7 49.4 231.6 N Ox µgm−3 142.2 103.7 3.9 123.7 1728.0 Table 1. Descriptive statistics of the data. 4 Data and Experimental Setup Data origin. There is only one communication station for measuring the air quality in the city of Wroclaw, and it is located within a wide street with two traffic lanes in each direction (GPS coordinates: 51.086390 North, 17.012076 East). The center of one of the largest intersections in the city with 14 traffic lanes is located approximately 30 meters from the measuring station, and is cov- ered by traffic monitoring. The measurement station is located on the outskirts of the city, at 9.6km from the airport. Pollution data are collected by the Provin- cial Environment Protection Inspectorate and encompasses the hourly N O2 and N Ox concentration values during full three years, from 2015 to 2017. Traffic data are provided by the Traffic Public Transport Management Department of the Roads and City Maintenance Board in Wroclaw, and include hourly count of all types vehicles passing the intersection. Public meteorological data are provided by the Institute of Meteorology and Water Management, and they include: air temperature, solar duration, wind speed, relative humidity, and air pressure. In order to have uniform data, solar duration values have been re-normalized in the real interval [0, 1] (as standard). The full data set contains 26304 observations. In the pre-processing phase, the instances with at least one missing value (617 samples, 2.3%) have been deleted. Some basic statistic indicators on the remain- ing 25687 instances are presented in Table 1, along with the symbol used in the tests for each variable. Unlike previous work in these data (e.g., in [11, 12]), we are interested in extracting rules for particular cases, not satisfactorily explained by a general regression model (e.g., sudden spikes in the contaminant concentra- tions, or unexpectedly low values of the contaminant concentration in presence of high values of the parameters that are linked to the common causes). In this experiment, we have focused on N O2 values, eliminating the column N Ox from the data set; that is, our interest are the rules whose consequent is N O2 . Experimental setup. We have performed one execution for each optimization model (from (4) to (7)) with population size 100 and 1000 total evaluations; our exploratory analysis showed that increasing the number of evaluations does not contribute to improve the results. We have set APRIORI to extract a maximum of 30 rules, with minimal support varying from 0.1 to 0.4 (with a 0.1 step) and minimal confidence varying from 0.7 to 0.85 (with 0.05 step). Discovering at most 30 rules is purely arbitrarily, and may affect the outcome; nonetheless, since one of our goals is to extract more rules, such parameter suffices when increasing support and/or confidence. Moreover, observe that this choice depends on our particular application, as we are interested in predicting and explaining the amount of contaminant concentration (N O2 ). Being multiobjective, each execution of a model gives rise to a Pareto front at the end of the last evolution to which we applied a simple decision making criteria, that is, we selected the best individual in terms of the second objective (number of rules in model (4) and (5), average confidence of the rules in model (6) and (7)). 5 Results and Discussion Results. In Table 2 and Table 3 we show the results of the execution of opti- mization model (4) and (6). The results of model (5) (resp., (7)) are very similar to those of model (4) (resp., (6)), and therefore are not shown. Model (4) obvi- ously extracts more rules than (6) in every configuration, given that we optimize precisely the number of rules. In each line of both tables, we show: the mini- mal support and confidence of rules with which APRIORI has been called; the number of rules of the selected final solution (more rules is better); the average length of the rules (smaller rules is better for, e.g., interpretability); the average confidence (more confident rules is better); and a quality index that depends on our particular application, that is, the number of rules whose consequent is N O2 with highest values. In particular, we are interested in extracting rules that may possibly explain sudden, and not necessarily common, high values of the pollutant; yet, this parameter is merely informative, as not only depends on the application, but, also, its importance is subject to interpretation. As we can see, in both models we are able to extract progressively less rules as the mini- mal support and minimal confidence that are requested grow. To each solution is associated a mask, that is, the set of values (k1 , l1 ), (k2 , l2 ), . . . , (kn , ln ) that correspond to the particular discretization; this mask, in turn, gives rise to the values that have been used to decide the bin to which each attribute of each particular instance belongs. As it often happens in rule extraction, interpreting the results is not immediate due to the number of extracted rules. While a cor- rect and detailed N O2 contamination model is outside the scope of this paper, from [11, 12] a general principle emerges: to more traffic, and to less wind, cor- responds higher values of N O2 in the atmosphere. Therefore, in searching for particular cases that may add explanatory value to the general model, we are interested in rules that defy such a general principle, and may therefore indicate that other parameters (inside or outside the available data set) play some role. There are three situations that may be interesting in this sense: higher N O2 with lower traffic; higher N O2 with no traffic (i.e., rules in which the traffic value has no role); and lower N O2 with high traffic. As it turns out, from the optimization models (4) and (6), rules that belong to the third group do occur; examples of these are in Table 4. Observe that, since both optimization models Min. Supp. Min. Conf. # Rules Av. length Quality ind. Av. Conf. 0.70 30 2.96 9 0.76 0.75 30 2.40 0 0.82 0.1 0.80 29 2.79 0 0.83 0.85 9 2.44 0 0.88 0.70 19 1.68 0 0.76 0.75 9 1.77 0 0.80 0.2 0.80 5 1.88 0 0.83 0.85 1 2.00 0 0.86 0.70 4 1 0 0.74 0.75 2 1.5 0 0.79 0.3 0.80 1 2 0 0.81 0.85 0 0 0 - 0.70 1 2 0 0.71 0.75 1 2 0 0.79 0.4 0.80 0 0 0 - 0.85 0 0 0 - Table 2. Results of the execution of the optimization model (4). Min. Supp. Min. Conf. # Rules Av. length Quality ind. Av. Conf. 0.70 14 2.57 1 0.88 0.75 4 2 0 0.89 0.1 0.80 2 2.5 0 0.9 0.85 1 2 0 0.92 0.70 2 1.5 0 0.85 0.75 2 1.5 0 0.85 0.2 0.80 2 1.5 0 0.85 0.85 1 2 0 0.86 0.70 1 1 0 0.79 0.75 1 1 0 0.82 0.3 0.80 1 2 0 0.82 0.85 0 0 0 - 0.70 1 2 0 0.79 0.75 1 2 0 0.8 0.4 0.80 0 0 0 - 0.85 0 0 0 - Table 3. Results of the execution of the optimization model (6). minimize the number of bins, such discovered rules encompass binary decisions (i.e., two labels) for all variables which are (arguably) informative for the specific domain-application. Comparison with existing methods. A natural question at this point is whether we could have obtained similar results with the existing methods. To answer this question we considered a static discretizer (the class Discretize in WEKA, which behaves identically to the other available static filters in R Rule # of cases Conf. d > 0.25 ∧ w > 3.18 ∧ h < 61.54 ∧ t > 2778.70 → N O2 ≤ 79.40 929 0.88 w > 3.18 ∧ t > 2778.70 → N O2 ≤ 79.40 1596 0.86 w > 3.18 ∧ t > 4403.88 → N O2 ≤ 79.40 888 0.92 w > 3.18 ∧ h ≤ 61.54 ∧ t > 2778.70 → N O2 ≤ 79.40 832 0.86 Table 4. Some interesting extracted rules. Fig. 2. Discretizing N O2 in 2 bins. In red, we show the mean, in yellow we show the splitting points chosen by our optimization algorithm, in blue the splitting points obtained by static equal-frequency binning, and in green the splitting points obtained by static equal-width binning. and in Python). As we have discussed in Section 2, a static discretizer usu- ally offers equal-width and equal-frequency binning. While for normally dis- tributed variables (i.e., Gaussian-like) both settings may return meaningful re- sults, for non-normally distributed ones a further analysis of this aspect is nec- essary. Consider the case, for example, of the variable N O2 in our data, which is strongly right-skewed. In Figure 2 we compare the result obtained by the three different solutions (equal-width, equal-frequency, and our dynamic dis- cretizer) in the case of two bins; because of the distribution, equal-width bin- ning in this case returns a bin (N O2 ≤ 117.64) which is meaningless: as a matter of fact, all rules that can be extracted with the different settings (sup- port varying from 0.1 to 0.4, confidence varying from 0.7 to 0.85) return rules with that particular bin as consequent, which is uninformative because it is true almost in every instance. Therefore, equal-width binning makes little sense in our data set. Equal-frequency binning returned better results. For an adequate comparison, for each experimental setup, we have chosen the best mask (e.g., [(0.2, 2), (0.4, 2), (1.0, 3), (1.0, 2), (0.7, 5), (0.6, 5), (1.0, 2)]) returned by our algo- rithm, and asked the static filter to use that particular choice as number of bins for each variable (observe that deciding the number of bins is a problem in static binning - obviously, choosing the best option goes towards the fairest comparison). The results of such an experiments are shown in Table 5: only a few rules emerged, and none of them falls in one of the interesting categories that Min. Supp. Min. Conf. # Rules Av. length Quality ind. Av. Conf. 0.70 19 2.73 7 0.77 0.75 0 - - - 0.1 0.80 5 2 0 0.88 0.85 6 2.16 0 0.88 0.70 0 - - - 0.75 0 - - - 0.2 0.80 0 - - - 0.85 0 - - - 0.70 0 - - - 0.75 0 - - - 0.3 0.80 0 - - - 0.85 0 - - - 0.70 0 - - - 0.75 0 - - - 0.4 0.80 0 - - - 0.85 0 - - - Table 5. Results of the execution of the static discretizer. Fig. 3. Discretizing N O2 in 3 bins. In red, we show the mean, in yellow we show the splitting points chosen by our optimization algorithm, in blue the splitting points obtained by static equal-frequency binning, and in green the splitting points obtained by static equal-width binning. we have mentioned above. The most frequent result in all experiment shows a 3- bins splitting for N O2 , and Figure 3 gives us a hint of the problems that emerge during equal-frequency binning: as one can see, the resulting middle group is too narrow for a meaningful rule to be extracted. 6 Conclusions In this paper we have considered a data set that represents the hourly mea- surement of several air quality parameters on a specific monitoring station in the center of Wroclaw (Poland) during the interval of three years. Instead of searching for a general explanation model for these data, we have focused on extracting meaningful local rules w.r.t. the pollutant, which may be useful to model the particular situations that occurred during the measurement period. We proposed and tested a multivariate, non-deterministic, distribution-based discretization algorithm coupled with the well-known rule extraction algorithm APRIORI, and implemented it via the evolutionary algorithm NSGA-II. We ex- perimentally proved that our solution produces more rules and, in some cases, more interesting rules w.r.t. the specific domain-application than those produced by the discretization algorithms available in some open-source learning libraries. References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. of the 20th International Conference on Very Large Data Bases, pages 487–499, 1994. 2. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of the 11th International Conference on Data Engineering, pages 3–14, 1995. 3. R. Agrawal and R. Srikant. Mining quantitative association rules in large relational tables. SIGMOD Rec., 25(2):1–12, 1996. 4. V. Bartı́k and J. Zendulka. Distance-Based Methods for Association Rules Mining, pages 689–694. IGI Global, 2nd edition, 2008. 5. B. Chlebus and S. Nguyen. On finding optimal discretizations for two attributes. In Proc. of the 1st International Conference on Rough Sets and Current Trends in Computing, pages 537–544, 1998. 6. Y. Collette and P. Siarry. Multiobjective Optimization: Principles and Case Studies. Springer Berlin Heidelberg, 2004. 7. K. Deb. Multi-objective optimization using evolutionary algorithms. Wiley, London, UK, 2001. 8. J. Durillo and A. Nebro. jMetal: a java framework for multi-objective optimization. Avances in Engineering Software, 42:760 – 771, 2011. 9. C. Emmanouilidis, A. Hunter, J. Macintyre, and C. Cox. A multi-objective genetic algorithm approach to feature selection in neural and fuzzy modeling. Evolutionary Optimization, 3(1):1–26, 2001. 10. F. Jiménez, G. Sánchez, J. Garcı́a, G. Sciavicco, and L. Miralles. Multi-objective evolutionary feature selection for online sales forecasting. Neurocomputing, 234:75– 92, 2017. 11. J. A. Kamińska. Probabilistic forecasting of nitrogen dioxide concentrations at an urban road intersection. Sustainability, 10(11), 2018. 12. J. A. Kamińska. A random forest partition model for predicting N O2 concentra- tions from traffic flow and meteorological conditions. Science of the Total Envi- ronment, 651:475–483, 2019. 13. B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proc. of the 13th International Conference on Data Engineering, pages 220–213. IEEE, 1997. 14. B. Minaei-Bidgoli, R. Barmaki, and M. Nasiri. Mining numerical association rules via multi-objective genetic algorithms. Information Sciences, 233:15–24, 2013. 15. M. N. Moreno, S. Segrera, V. F. López, and M. J. Polo. A method for mining quantitative association rules. In Proc. of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, pages 173–178, 2006. 16. A. Mukhopadhyay, U. Maulik, S. Bandyopadhyay, and C. C. Coello. A survey of multiobjective evolutionary algorithms for data mining: Part I. IEEE Transactions on Evolutionary Computation, 18(1):4–19, 2014. 17. B. Özden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. In Proc. of the 14th International Conference on Data Engineering, pages 412–421, 1998. 18. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. 19. R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. 20. A. Salleb-Aouissi, C. Vrain, C. Nortet, X. Kong, V. Rathod, and D. Cassard. Quantminer for mining quantitative association rules. Journal of Machine Learning Research, 14(1):3153–3157, 2013. 21. Q. Tong, B. Yan, and Y. Zhou. Mining quantitative association rules on overlapped intervals. In Proc. of the 1st International Conference on Advanced Data Mining and Applications, pages 43–50. Springer-Verlag, 2005. 22. I. Witten, E. Frank, and M. Hall. Data mining: practical machine learning tools and techniques, 3rd Edition. Morgan Kaufmann, Elsevier, 2011. Appendix: NP-hardness of Optimal Binning It is well known that optimal preprocessing is a computationally hard problem, and commonly accepted that learning algorithms use sub-optimal (deterministic or heuristic-based) solutions. Yet, precisely pinpointing the formal problem is mathematically interesting per se. Discretization as a classification task has been studied from the theoretical point of view in [5], and shown to be NP-hard; in this section we prove how discretization for rule extraction can be reduced to it. Consider a numerical data set A with n features plus one (categorical) class D (in this context, referred to as decision): a11 a12 . . . a1n d1 a21 a22 . . . a2n ds A= . . . . . . . . . . . . . . . am1 am2 . . . amn dm A pair (Ai , c), where Ai is an attribute and c ∈ R is called cut. Any set of cuts {(Ai , c1 ), . . . , (Ai , cl )} implicitly defines the partition {(−∞, c1 ), [c1 , c2 ), . . . , [cl , +∞)} (denoted {γi1 , . . . , γili } in Section 3). Having a partition for each at- tribute in A allows us to discretize the original data set to obtain a categorical one: γ11 γ12 . . . γ1n d1 γ21 γ22 . . . γ2n ds A= . . . . . . . . . . . . . . . γm1 γm2 . . . γmn dm Such a discretization is said to be D-consistent if there are no two instances with the same categorical values and different class; in other words a D-consistent transformation is the perfect classification. If A can be D-consistently discretized using at most l cuts (in total), we write l → A. Now, let n = 2 and let D be a binary class. One can define the language: L = {hA, li | l → A} In [5], it has been shown that deciding L is NP-hard. Adapting this result to our case is now easy. Indeed, the problem of finding the perfect discretization for extracting association rules from a numerical data set can be formalized as follows. Let A be a numerical data set with n attributes, each of which we can partition in the same way as before. Each discretization implicitly entails the existence of rules with a certain support and confidence, and, in particular, of rules whose consequent is any handpicked attribute; let us focus on rules with Am as consequent. As we know, rules are evaluated by their support and confidence. Let us denote by l, t →∗s,c A the fact that A can be discretized in such a way that no more than l (total) bins are used, of which exactly t are the bins of the attribute Am , in such a way that for each interval γ of the attribute Am a group of rules can be extracted with confidence at least c and such that the sum of all supports is at least s. Observe that this discretization arguably describes what, in Section 3, we have called extracting the highest possible number of meaningful and interpretable rules. The problem of finding the perfect discretization in such terms can be then described as a decision problem, as follows: L̄ = {hA, l, t, s, ci | l, t →∗s,c A} Theorem 1. L̄ is N P -hard. Proof. The proof is by polynomial reduction, that is, we prove that L ≤p L̄. To this end, let A be a numerical data set with n = 2 and a binary decision D. We produce a numerical data set A0 with n = 3, where A01 = A1 , A02 = A2 , and A03 has domain {0, 1} (that is, we copy the entire data set column-by-column, interpreting the decision as a numerical attribute). Now, we claim that l → A if and only if l + 2, 2 →∗1,1 A0 , that is, hA, li ∈ L if and only if hA, l + 2, 2, 1, 1i ∈ L̄. Suppose, first, that l + 2, 2 →∗1,1 A0 . By definition, there exists at least one rule γ1 ∧ . . . ∧ γs → 0 and at least one rule γ10 ∧ . . . ∧ γs0 0 → 1 that hold in (the discretized counterpart of) A0 with minimal confidence 1. Because of the requirement on the minimal total support, every instance is the support of some rule; because of the requirement on the minimal local confidence, no instance can be in the support of more than one rule. It is now immediate to interpret the set of extracted rules as a rule-based classifier with perfect score, that is, to see that l → A. Suppose, on the other hand, that l → A holds. Every instance can be seen as a rule; all such rules have confidence at least one, because A is D-consistently discretized, and no instance is left without its corresponding rule, which means that the total support is 1, that is, that l + 2, 2 →∗1,1 A0 ,as we wanted. t u