=Paper=
{{Paper
|id=Vol-3606/51
|storemode=property
|title=(Un)supervised Univariate Feature Extraction and Selection for Dimensional Data
|pdfUrl=https://ceur-ws.org/Vol-3606/paper51.pdf
|volume=Vol-3606
|authors=Patrik Cavina,Federico Manzella,Giovanni Pagliarini,Guido Sciavicco,Eduard I. Stan,Paola Barra,Zied Mnasri,Danilo Greco,Valerio Bellandi,Silvana Castano,Alfio Ferrara,Stefano Montanelli,Davide Riva,Stefano Siccardi,Alessia Antelmi,Massimo Torquati,Daniele Gregori,Francesco Polzella,Gianmarco Spinatelli,Marco Aldinucci
|dblpUrl=https://dblp.org/rec/conf/itadata/CavinaMPSS23
}}
==(Un)supervised Univariate Feature Extraction and Selection for Dimensional Data==
(Un)supervised Univariate Feature Extraction and Selection for Dimensional Data Patrik Cavina1,∗,† , Federico Manzella1,∗,† , Giovanni Pagliarini1,∗,† , Guido Sciavicco1,∗,† and Eduard I. Stan1,2,∗,† 1 ACLAI Laboratory, University of Ferrara (Italy) 2 DBS Group, Free University of Bozen-Bolzano (Italy) Abstract Feature selection, defined as the automatic selection of the most relevant features from a machine learning dataset, has three objectives: to improve the prediction performance of the predictors, to provide faster and more cost-effective predictors, and to offer a better understanding of the underlying process that generated the data. In the case of dimensional (e.g., temporal or spatial) data, feature selection is often approached using standard methods that neglect or denature the dimensional component. This paper provides a first step towards systematic and general dimensional feature selection, with a portfolio of supervised and unsupervised, filter-based selectors that can be naturally combined into an end-to-end methodology. In a hypothesis-testing setting, our experiments show that our approach can extract provably relevant features in both the temporal and spatial cases. Keywords Feature selection, Filter-based selection, Non-tabular data, Dimensional data 1. Introduction Data analysis involves inspecting, cleaning, transforming, and conceptualizing data. The goal is simple: uncover valuable insights, foster informed conclusions, and provide strategic decision-making. This practice unfolds through multiple layers of a data pre-processing pipeline, including feature extraction and selection. Feature extraction is the process of producing mean- ingful features that capture the underlying complexity of data, revealing valuable insights that may be hidden in the raw data by enhancing its predictive power. Feature selection, on the other hand, identifies and selects the most informative subset of features, with the objective of reducing the dimensionality of the data and eliminating noise and redundancy while pre- serving as much information as possible. Sometimes, feature engineering and selection can be combined into a sequential process that first creates several new features and then selects the most informative ones from the resulting extended set. ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy ∗ Corresponding author. † These authors contributed equally. Envelope-Open patrik.cavina@edu.unife.it (P. Cavina); federic.manzella@edu.unife.it (F. Manzella); giovanni.pagliarini@unife.it (G. Pagliarini); guido.sciavicco@unife.it (G. Sciavicco); ioneleduard.stan@unibz.it (E. I. Stan) Orcid 0009-0003-9085-3815 (P. Cavina); 0000-0002-4944-2163 (F. Manzella); 0000-0002-8403-3250 (G. Pagliarini); 0000-0002-9221-879X (G. Sciavicco); 0000-0001-9260-102X (E. I. Stan) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Feature selection methods can be classified into filter, wrapper, embedded and hybrid meth- ods [1]. Filter-based methods are independent of the learning models and base their estimate of feature importance on heuristic or statistical ranking criteria. Filters can be characterized in two orthogonal ways: univariate versus multivariate, and supervised versus unsupervised. Univariate filters rank each feature independently, and thus disregard any form of inter-feature correlation. Multivariate filters, on the contrary, rank multiple features in batches and can account for interdependencies between features [1]. Supervised filters estimate the relevance of a feature (group) with respect to a target variable, that is, the label, while unsupervised ones select features only according to their own nature and characteristics [2]. Filters can be further characterized depending on the statistical principle on which they are based: typical choices in the unsupervised case include filters based on the variance, entropy, or Laplacian score [3], while supervised ones can be based on correlation/covariance, entropy gain, supervised Laplacian score, hypothesis test(s), or discriminating power [4]. Wrapper methods, unlike filter ones, use a learning model to evaluate the performance of different feature subsets; they iterate a search process until an optimal result, or some stopping condition, is reached, where the search strategies can be random, sequential, or heuristic [5]. Embedded methods integrate the feature selection process into the learning model. Hybrid methods are combinations of filters and wrappers that take advantage of both techniques: the filter reduces the feature set to a good enough subset, and the wrapper maximizes performance. [1]. In the era of big data, more-than-tabular data embodies 95% of all existing data [6]. Tabular data is often found in spreadsheets and relational databases, which can be easily analyzed using data analytic software. Non-tabular data, such as sets of time series, images, videos, graphs and texts, is subjective and interpretative, meaning that data analytic tools may not always be readily accessible, which prompts practitioners to develop ad-hoc tools or enhance existing ones. A particular, but very common, case of non-tabular data is 𝑑-dimensional data, in which every instance is represented by an 𝑛-dimensional array of real functions defined on a 𝑑-dimensional discrete space, covering virtually all cases of real-world temporal (𝑑 = 1), spatial (𝑑 = 2, 𝑑 = 3), video (𝑑 = 3) data, as well as scalar ones (𝑑 = 0). Learning from temporal and/or spatial datasets has been primarily accomplished with sub- symbolic techniques, specifically neural networks; the current literature on this topic is too wide to be reviewed here. Recently, however, symbolic learning in both the temporal and the spatial case has started to be systematically explored. Among several examples, for the temporal case we mention [7], for extracting temporal rules, [8], for decision tree-based procedures to classify time series data, [9], in which shapelets are used to classify time series, and [10], in which classifiers designed to extract point-based temporal logic formulas are used to solve the classification problem involving time series. Symbolic methods, and in particular propositional decision trees, have also been applied to spatial data [11]. Recently, symbolic learning from 𝑑-dimensional data has been approached with a rather new technique known as modal symbolic learning. Modal symbolic learning is based on the idea that propositional logic can be replaced by modal logic as a tool to describe 𝑑-dimensional instances, and therefore, used to extract patterns. In modal symbolic learning, 𝑑-dimensional data are analyzed by considering all 𝑑- dimensional hyperintervals of a dataset as well as their qualitative relationships, and patterns are described with a suitable modal logic with enough expressive power to represent such relationships. Although the theoretical properties of modal symbolic learning techniques have started to be studied only recently [12], modal decision trees and modal random forests have been successfully applied to several different temporal and spatial datasets [13, 14]. As it turns out, modal symbolic learning on 𝑑-dimensional data is naturally associated to feature engineering and selection specifically designed to explore 𝑑-dimensional hyperintervals. In this work, we focus on designing a univariate filter-based feature extraction and selection technique, that includes both unsupervised and supervised methods. Such methods are the 𝑑-dimensional generalization of standard ones, and can be combined into a general protocol to select the most informative combinations of features and variables, whose informative value can be assessed independently of learning models. 2. Dimensional Data Let a tabular instance, also referred to as adimensional instance, be an object of the type 𝐼 = (𝜈1 , … , 𝜈𝑛 ) where 𝜈𝑖 ∈ ℝ, for each 1 ≤ 𝑖 ≤ 𝑛. Each value 𝜈𝑖 is the value for a variable, whose name is denoted by 𝑉𝑖 ; in the following, we identify variables with their names. A tabular dataset (or adimensional dataset) is, therefore, a collection ℐ = {𝐼1 , … , 𝐼𝑚 } of tabular instances whose values are taken from a set of variables 𝒱 = {𝑉1 , … , 𝑉𝑛 }. Tabular datasets can be also labelled if each instance 𝐼 is associated to a unique label 𝐿 from a set ℒ = {𝐿1 , … , 𝐿𝑘 }. Tabular datasets are classic, scalar datasets as they are usually defined in data science and machine learning; note that categorical variables can always be transformed into scalar ones via one-hot encoding. Tabular datasets, however, are a particular case of dimensional ones. A 𝑑-dimensional instance is an 𝑛-tuple of functions 𝐼 = (𝜈1 , … , 𝜈𝑛 ), each defined as 𝜈𝑖 ∶ ℕ𝑑 → ℝ, and a 𝑑-dimensional dataset a collection ℐ = {𝐼1 , … , 𝐼𝑚 } of 𝑑-dimensional instances each possibly associated to a unique label. The notion of 𝑑-dimensional dataset generalizes tabular datasets (when 𝑑 = 0), as well as datasets of time series (𝑑 = 1), images (𝑑 = 2), and videos (𝑑 = 3), among others. In a 𝑑-dimensional instance too, a variable (function) 𝜈𝑖 has a name 𝑉𝑖 . In the following, we restrict ourselves to the case of finite datasets, in which we have exactly 𝑁 distinct points in every dimension. Information extraction from 𝑑-dimensional data occurs via using a predetermined set of feature extraction functions ℱ = {𝑓1 , … , 𝑓𝑙 }. Each function 𝑓𝑖 is abstractly defined as 𝑓𝑖 ∶⋃ℝ𝑗1 × … × ℝ𝑗𝑑 → ℝ. 1≤𝑗1 ,…,𝑗𝑑 ≤𝑁 Examples are the generalized mean, maximum and minimum, but ℱ may include more elaborate examples whose concrete definition may depend on the relative order of values; an example in this category in the case 𝑑 = 1 could be the number of local maxima in a given interval of values. A function 𝑓 is applied to a variable 𝑉 to generate the feature 𝑓 (𝑉 ). There are many possible logical representations for a 𝑑-dimensional dataset. A very natural one is inspired by both the interval temporal logic ℋ 𝒮 [15] and the Rectangle Algebra [16], which, in turn, can be immediately generalized to the case of 𝑑-dimensions. Let 𝔻 be an initial prefix of ℕ of cardinality 𝑁, and let 𝔻𝑑 its corresponding 𝑑-dimensional generalization. Let 𝑃𝑗𝑖 HS modality Definition w.r.t. the interval structure Example 𝑥 𝑦 𝑤 𝑧 ⟨𝐴⟩ (after) [𝑥, 𝑦]𝑅𝐴 [𝑤, 𝑧] ⇔ 𝑦 =𝑤 𝑤 𝑧 ⟨𝐿⟩ (later) [𝑥, 𝑦]𝑅𝐿 [𝑤, 𝑧] ⇔ 𝑦 <𝑤 𝑤 𝑧 ⟨𝐵⟩ (begins) [𝑥, 𝑦]𝑅𝐵 [𝑤, 𝑧] ⇔ 𝑥 =𝑤 ∧𝑧 <𝑦 𝑤 𝑧 ⟨𝐸⟩ (ends) [𝑥, 𝑦]𝑅𝐸 [𝑤, 𝑧] ⇔ 𝑦 =𝑧∧𝑥 <𝑤 ⟨𝐷⟩ (during) [𝑥, 𝑦]𝑅𝐷 [𝑤, 𝑧] ⇔ 𝑥 <𝑤 ∧𝑧 <𝑦 𝑤 𝑧 ⟨𝑂⟩ (overlaps) [𝑥, 𝑦]𝑅𝑂 [𝑤, 𝑧] ⇔ 𝑥<𝑤 <𝑦 <𝑧 𝑤 𝑧 Table 1 Allen’s interval relations and HS modalities. Equality (=) is not displayed. the (𝑑 − 1)-dimensional discrete space passing through the 𝑗-th point of the 𝑖-th component of 𝔻𝑑 and whose normal is non-null only on that component (e.g., when 𝑑 = 1, 𝑃𝑗1 is simply the 𝑗-th point; when 𝑑 = 2, 𝑃𝑗1 is the straight that goes through the 𝑗-th point and is parallel to the vertical axis). A hyperinterval 𝐻 is a set 1 , 𝑃1 𝐻 = {(𝑃𝑙𝑜𝑤 𝑑 𝑑 1 ℎ𝑖𝑔ℎ1 ), … , (𝑃𝑙𝑜𝑤𝑑 , 𝑃ℎ𝑖𝑔ℎ𝑑 )} where 𝑙𝑜𝑤𝑖 ≤ 𝑢𝑝𝑖 , for each 1 ≤ 𝑖 ≤ 𝑑. Thus, this notion simply generalizes the one of interval in one dimension. Describing a hyperinterval is easily achieved via a 𝑑-tuple of pairs of natural numbers such as 𝐻 = [(𝑥1 , 𝑦1 ), … , (𝑥𝑑 , 𝑦𝑑 )], where the 𝑖-th component corresponds to the 𝑖-th dimension, and, as before, 𝑥𝑖 ≤ 𝑦𝑖 , for each 1 ≤ 𝑖 ≤ 𝑑. A notion of a point belonging to a hyperinterval, denoted by (𝑥1 , … , 𝑥𝑑 ) ∈ 𝐻, emerges naturally. Given the 13 so-called Allen’s relations that allow one to qualitatively describe the geometric arrangement between any two intervals in a linear order, whose notation and informal semantics is depicted in Tab. 1, it is immediate to generalize such relations to the case of hyperintervals. Thus, given any two hyperintervals 𝐻 , 𝐻 ′ , they are related to each other by exactly one hyperinterval relation 𝑅𝑋1 …𝑋𝑑 , where 𝑋𝑖 ∈ 𝒳 = {𝐴, 𝐴, 𝐿, 𝐿, 𝐵, 𝐵, 𝐸, 𝐸, 𝐷, 𝐷, 𝑂, 𝑂, =}, for each 1 ≤ 𝑖 ≤ 𝑑, and 𝑅𝑋1 𝑋2 …𝑋𝑑 , which generalizes the definition of an Allen’s relation to 𝑑 dimensions (see Fig. 1). Given a set 𝒫 of propositional letters, the logic ℋ 𝒮 𝑑 is the 𝑑-dimensional generalization of the interval temporal logic ℋ 𝒮. Well-formed ℋ 𝒮 𝑑 formulas are obtained by the following syntax: 𝜑 ∶∶= 𝑝 ∣ ¬𝜑 ∣ 𝜑 ∨ 𝜑 ∣ ⟨𝑋1 … 𝑋𝑑 ⟩𝜑, where, for each 𝑖, 𝑋𝑖 ∈ 𝒳, and ⟨𝑋1 … 𝑋𝑑 ⟩ ≠ ⟨= … =⟩. The remaining Boolean operators, as well as the universal version of each of the 13𝑑 − 1 existential operators can be defined as a shortcut via duality. The strict semantics of ℋ 𝒮 𝑑 is given in terms of 𝑑-dimensional models of the type 𝑀 = ⟨𝕀(𝔻𝑑 ), 𝜙⟩, where 𝕀(𝔻𝑑 ) is the set of all hyperintervals over 𝔻𝑑 , and 𝑉 is a valuation function 𝑦2 𝑦2 𝑧2 𝑧2 𝐻1 𝐻1 𝐻2 𝐻2 𝐻1 𝐻2 𝑤2 𝑥2 𝑥1 𝑤1 𝑦1 𝑧1 𝑥2 𝑤2 𝑧3 𝑦 3 𝑤 3 𝑥 𝑥1 𝑤1 𝑧1 𝑦1 𝑥1 𝑤1 𝑦 1 𝑧1 3 𝐻 1 𝑅𝑂 𝐻 2 𝐻1 𝑅𝑂𝐷 𝐻2 𝐻1 𝑅𝐷𝑂𝑂 𝐻2 Figure 1: Examples of hyperintervals and ℋ 𝒮 𝑑 relations for 𝑑 ∈ {1, 2, 3}. 𝑑 𝜙 ∶ 𝒫 → 2𝕀(𝔻 ) which assigns to every atomic proposition 𝑝 ∈ 𝒫 the set of hyperintervals 𝜙(𝑝) on which 𝑝 holds. The truth of a formula 𝜑 on a given hyperinterval 𝐻 in an interval model 𝑀, denoted by 𝑀, 𝐻 ⊩ 𝜑, is defined by structural induction on the complexity of formulas as follows: 𝑀, 𝐻 ⊩ 𝑝 if and only if 𝐻 ∈ 𝜙(𝑝), for each 𝑝 ∈ 𝒫 ; 𝑀, 𝐻 ⊩ ¬𝜓 if and only if 𝑀, 𝐻⊩𝜓̸ ; 𝑀, 𝐻 ⊩ 𝜓1 ∨ 𝜓2 if and only if 𝑀, 𝐻 ⊩ 𝜓1 or 𝑀, 𝐻 ⊩ 𝜓2 ; 𝑀, 𝐻 ⊩ ⟨𝑋1 … 𝑋𝑑 ⟩𝜓 if and only if there exists 𝐻 ′ s.t. 𝐻 𝑅𝑋1 …𝑋𝑑 𝐻 ′ and 𝑀, 𝐻 ′ ⊩ 𝜓 . The notion of formula satisfied by a model, satisfiable formula, and valid formula are defined in the standard way. Quite naturally, a 𝑑-dimensional instance described by the variables in 𝒱 can be seen as a 𝑑- dimensional model, provided that, fixed a set of feature extraction functions ℱ, the propositional vocabulary is defined as 𝒫 = {𝑓 (𝑉 ) ⋈ 𝑣 ∣ 𝑓 ∈ ℱ , 𝑉 ∈ 𝒱 , ⋈ ∈ {<, ≤, =, ≠, ≥, >}, 𝑣 ∈ ℝ}, so that, given a 𝑑-dimensional instance 𝐼, a hyperinterval 𝐻 = [(𝑥1 , 𝑦1 ), … , (𝑥𝑑 , 𝑦𝑑 )], and a variable 𝑉, we can define the value of 𝐼 on 𝑉 in 𝐻, denoted 𝐼 (𝐻 , 𝑉 ) as a hypermatrix in ℝ𝑦1 −𝑥1 +1 ×…×ℝ𝑦𝑑 −𝑥𝑑 +1 whose generic element with indexes 𝑗1 , … , 𝑗𝑑 , with 1 ≤ 𝑗𝑖 ≤ 𝑦𝑖 − 𝑥𝑖 , for all [𝑥𝑖 , 𝑦𝑖 ] ∈ 𝐻, is defined as: 𝐼 (𝐻 , 𝑉 )𝑗1 ,…,𝑗𝑑 = 𝜈(𝑥1 + 𝑗1 − 1, … , 𝑥𝑑 + 𝑗𝑑 − 1). Then, for a given propositional letter 𝑝 = 𝑓 (𝑉 ) ⋈ 𝜈, we define 𝜙(𝑓 (𝑉 ) ⋈ 𝜈) = {𝐻 ∣ 𝐻 ∈ 𝕀(𝔻𝑑 ) and 𝑓 (𝐼 (𝐻 , 𝑉 )) ⋈ 𝜈}. Modal decision trees and modal random forests, when applied to 𝑑-dimensional datasets, may extract patterns written in ℋ 𝒮 𝑑 [13, 14]. Therefore, establishing which variables and feature extraction functions, that is, which features, are informative in which hyperintervals, is a natural pre-processing step. Typical 𝑑-dimensional datasets are described by hundreds or thousands of variables in several dimensions, to which tens of feature extraction functions can be applied; this leads to a number of features in the order of the tens of thousands, from which a choice of a few units is usually made for further analysis. DATA ℐ HYPERWINDOWING parameter description ℐ𝑤 𝑅 {𝑟1 , … , 𝑟𝑅 }: hyperwindowing ℱ {𝑓1 , … , 𝑓𝑙 }: feature extraction FEATURE EXTRACTION 𝑈 unsupervised algorithm ℐ′ 𝑈𝑎𝑔𝑔𝑟 unsupervised aggregation UNSUPERVISED SELECTION 𝐸𝑈 unsupervised selected fraction 𝑆 supervised algorithm ℐ″ 𝑆𝑎𝑔𝑔𝑟 supervised aggregation SUPERVISED SELECTION 𝐸𝑆 supervised selected fraction ℐ‴ … Figure 2: Schematic representation of the proposed pipeline and its parametrization. 3. Dimensional Feature Selection Given a 𝑑-dimensional dataset in which every instance is described by 𝑁 distinct points on each of the 𝑑 dimensions leads to the need of considering (𝑁 (𝑁 + 1)/2)𝑑 distinct hyperintervals; on each one of them, one should assess the value of 𝑛 distinct variables under 𝑙 distinct functions. A few considerations can be immediately drawn: (i) even in the context of filters, exploring all combinations seems infeasible, and (ii) hypothesis test-based selection should be avoided altogether, considering their probabilistic nature (that is, considering that performing too many probabilistic tests on the same data may lead to an unreliable result [17, 18]). However, the po- tential predictive power of features in 𝑑-dimensional data should be evaluated on hyperintervals, so that symbolic learning methods can take advantage of them. As a tradeoff, let us introduce the notion of hyperwindow. Let 𝔻𝑑 be a finite space defined as before, and let 𝐻1 , … , 𝐻𝑟 be a set of hyperintervals such that, for every point (𝑥1 , … , 𝑥𝑑 ) ∈ 𝔻𝑑 , there is a hyperinterval 𝐻𝑖 (1 ≤ 𝑖 ≤ 𝑟) with (𝑥1 , … , 𝑥𝑑 ) ∈ 𝐻𝑖 ; we call each such hyperinterval a hyperwindow, and the set 𝐻1 , … , 𝐻𝑟 a hyperwindowing of 𝔻𝑑 . By extracting all features on all hyperwindows for a given 𝑟 and evaluating them, one is able to approximate the behaviour of such features on a generic hyperinterval; there are, of course, several ways in which a hyperwindowing can be produced. Based on the idea that a feature can be evaluated on a hyperwindow, unsupervised and supervised methods in the whole range of classic techniques can be systematically used to this end. Given a 𝑑-dimensional dataset ℐ, with 𝑛 dimensional variables to which 𝑙 feature extraction functions can be applied, and given a set of 𝑅 hyperwindow parameterizations {𝑟1 , … , 𝑟𝑅 }, a high-level description of the entire approach can be given in the following steps: 1. A 𝑑-dimensional dataset ℐ 𝑤 is built by hyperwindowing the original one; ℐ 𝑤 encom- passes Σ𝑅𝑖=1 𝑟𝑖 𝑙 dimensional variables. 2. A tabular dataset is derived from ℐ 𝑤 , with Σ𝑅𝑖=1 𝑟𝑖 𝑛𝑙 adimensional features. Each feature represents the result of applying a feature extraction function to a variable in a particular hyperwindow, to which a min-max normalization is applied, to allow comparisons. This can be done in several ways: variable-wise, function-wise, function and window-wise, or across any combination of variables, depending on the nature of the dataset. Let ℐ ′ be the result of this phase. 3. An univariate unsupervised filter 𝑈 is applied to ℐ ′ . Since values are normalized, an unsupervised individual score (𝑈𝑠𝑐𝑜𝑟𝑒 ) can be computed for each triple variable-function- hyperwindow. Selecting a pre-determined fraction of pairs based on their 𝑈𝑠𝑐𝑜𝑟𝑒 produces a new tabular dataset ℐ ″ . 4. An univariate supervised filter 𝑆 is then applied to ℐ ″ . As before, each feature- hyperwindow pair is assigned a supervised individual score (𝑆𝑠𝑐𝑜𝑟𝑒 ). Selecting a pre- determined fraction of triples based on 𝑆𝑠𝑐𝑜𝑟𝑒 produces a new tabular dataset ℐ ‴ . It should be noted that the unsupervised and the supervised steps are actually independent of each other; however, the practice suggests that the unsupervised step should be applied first to eliminate the pairs that most probably do not contain any relevant information, and then, the supervised step (usually, computationally more expensive) should be applied to select the pairs that most probably contain information that is relevant to the problem. Furthermore, the above schema can be further extended by adding suitable aggregation functions: both unsupervised and supervised selection can be indeed performed after aggregating the triples either by feature extraction function, variable, hyperwindow, or a combination of them; in fact, it is convenient to assume that an aggregation function is always applied, whereas the simplest choice is an identical aggregation function with no effect. Other examples of aggregation functions include computing the maximum score or the average score across hyperwindows or across functions. Aggregation can be applied to the end result as well, in order to return the best variables, the best functions, and the best hyperwindows. Summarizing, the following parameters should be set for a given set of experiments: the set of feature extraction functions (ℱ), the hyperwindowing parametrization ({𝑟1 , … , 𝑟𝑅 }), the unsupervised filter (𝑈), the supervised filter (𝑆), and two aggregation functions (𝑈𝑎𝑔𝑔𝑟 , 𝑆𝑎𝑔𝑔𝑟 ), along with the fraction of selected elements after the unsupervised step (𝐸𝑈 ) and the fraction of selected elements after the supervised one (𝐸𝑆 ). A schematic representation of our approach and its parametrization can be found in Fig. 2. 4. Experimental Evaluation In order to evaluate the effectiveness of our approach we considered two cases, in the temporal (𝑑 = 1) and spatial (𝑑 = 2) setting, respectively. In both cases the datasets are public and had been used in the past for several other learning experiments. In the temporal case, we use a dataset of audio samples, specifically of cough recordings, each one is labelled as positive or negative to COVID-19 infection [19]. The dataset is composed of 9986 audio samples recorded by 6613 volunteers, of which 235 declared to be positive. This data was originally used to derive smaller datasets, each posing a different form of the same task of COVID-19 diagnosis depending on the specific set of symptoms and conditions of the subjects. We refer to these tasks and datasets as TA1, TA2, and TA3. After the original a) 0.20 0.15 𝑉 𝑎𝑟 0.10 0.05 0.00 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 0.18 0.30 0.16 MI FS 0.14 0.25 0.12 0.20 0 10 20 30 40 0 10 20 30 40 b) 0.2 𝑉 𝑎𝑟 0.1 0.0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 0.40 1.6 1.4 MI 1.2 FS 0.35 1.0 0.30 0.8 0 10 20 30 40 0 10 20 30 40 c) 0.2 𝑉 𝑎𝑟 0.1 0.0 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 0.22 0.6 0.20 FS MI 0.18 0.4 0.16 0.14 0 10 20 30 40 0 10 20 30 40 Figure 3: Summary graph of each feature selection step of the pipeline for each COVID-19 task: a) TA1, b) TA2, c) TA3. Each graph shows the score assigned to every feature sorted in non-ascending order. The blue area highlights the 𝐸𝑈 (unsupervised) and 𝐸𝑆 (supervised) selected features. The scores shown on the y axis are 𝑉 𝑎𝑟 for Variance, MI for Mutual Information and FS for Fisher Score. 1 2 3 4 5 6 7 8 9 10 11 12 13 FS 1.3e-17 1.3e-17 1.4e-16 1.0e-16 2.0e-16 3.6e-17 3.6e-17 6.6e-15 6.6e-15 2.3e-15 3.5e-16 3.5e-16 – TA3 TA2 TA1 MI 4.5e-16 2.2e-13 1.9e-10 5.6e-13 9.3e-15 3.4e-11 3.4e-11 3.4e-13 8.1e-17 1.3e-17 4.5e-15 6.6e-15 – FS 2.6e-9 4.1e-9 3.0e-9 8.2e-9 2.2e-9 2.3e-7 2.3e-7 1.3e-8 2.6e-7 2.6e-7 1.5e-8 5.8e-7 – TIME MI 2.2e-9 1.5e-5 3.0e-9 1.4e-6 3.6e-6 2.1e-6 2.0e-7 5.9e-9 2.3e-7 5.6e-7 1.6e-6 5.0e-6 – FS 0.001 0.001 0.001 0.002 0.002 0.002 0.002 0.002 0.002 0.011 0.001 0.002 – MI 21.074 0.078 0.001 0.002 0.192 16.728 0.036 0.051 0.183 0.019 0.067 0.024 – FS 9.0e-61 8.9e-61 8.9e-61 9.9e-61 9.0e-61 8.7e-61 1.0e-60 1.0e-60 1.0e-60 1.1e-60 1.1e-60 1.2e-60 – IP MI 1.7e-60 2.5e-59 1.5e-59 5.1e-60 6.6e-60 5.3e-60 4.9e-59 1.8e-60 4.4e-52 6.6e-60 9.3e-50 8.5e-51 – SPACE FS 3.4e-56 1.8e-56 3.0e-56 2.2e-56 2.7e-56 3.7e-56 2.8e-56 2.8e-56 4.9e-56 3.0e-56 4.2e-56 5.3e-56 4.7e-56 PC MI 3.7e-56 4.2e-56 2.7e-56 3.0e-56 1.2e-55 6.2e-56 3.0e-56 2.8e-56 1.2e-55 1.0e-55 4.8e-56 8.6e-56 3.4e-56 FS 8.3e-58 8.3e-58 8.3e-58 8.3e-58 8.3e-58 8.3e-58 8.3e-58 8.3e-58 8.3e-58 8.3e-58 8.4e-58 8.3e-58 8.4e-58 PU MI 3.0e-56 2.3e-57 9.7e-58 1.4e-57 1.2e-57 1.6e-57 1.1e-56 2.9e-56 9.5e-58 1.4e-56 9.2e-58 1.5e-56 9.3e-58 Table 2 𝑝-values of the selected features for COVID-19 (TIME) and LCC (SPACE) tasks. publication, these tasks were approached and solved in several different way, among which a) ⋅10−2 1.50 1.00 𝑉 𝑎𝑟 0.50 0.00 0 200 400 600 800 1,000 1.9 7 1.8 FS 6 MI 1.7 5 1.6 0 10 20 30 40 0 10 20 30 40 b) ⋅10−2 6.00 4.00 𝑉 𝑎𝑟 2.00 0.00 0 100 200 300 400 500 600 1.45 11 MI FS 1.40 10 1.35 9 0 10 20 30 40 0 10 20 30 40 c) ⋅10−2 6.00 4.00 𝑉 𝑎𝑟 2.00 0.00 0 100 200 300 400 500 600 1.35 14 1.30 12 FS MI 1.25 10 1.20 8 0 10 20 30 40 0 10 20 30 40 Figure 4: Summary graph of each feature selection step of the pipeline for each LCC task: a) Indian Pines, b) Pavia Centre, c) Pavia University. Each graph shows the score assigned to every feature sorted in non-ascending order. The blue area highlights the 𝐸𝑈 (unsupervised) and 𝐸𝑆 (supervised) selected features. The scores shown on the y axis are 𝑉 𝑎𝑟 for Variance, MI for Mutual Information and FS for Fisher Score. modal decision trees and forests [13]. In this paper, we considered the tasks TA1 (𝑚 = 343), TA2 (𝑚 = 71), and TA3 (𝑚 = 55). The temporal interpretation of audio signals emerges after the initial pre-processing via Mel-Frequency Cepstral Coefficients (MFCC) [20]. In short, MFCC consists of, first, separating the raw signal into chunks of small size and applying a Discrete Fourier Transform (DFT) to each of the chunks to produce a spectrogram of the sound at each chunk, second, converting and representing the frequency spectrum in mel scale, and third, convolving a set of 𝑛 triangular band-pass filters across the frequency spectrum, discretizing it into a finite number of frequencies. Ultimately, this produces a multivariate time series representation where the 𝑛 variables describe the power of the different sound frequencies; in this experiment, we fixed 𝑛 = 30, that is, the variables are 𝐹1 , … , 𝐹30 . For all these tasks we set 𝑅 = {1, 3, 6} for a total of 7 hyperwindows for each variable. There are several feature extraction functions that can be applied to temporal data. A commonly used portfolio is the so-called Catch22 set [21], encompassing 22 selected functions that have shown the ability of extracting relevant information from temporal data in several cases; to this set, we add minimum, maximum, and mean, leading to a set of 𝑙 = 25 feature extraction functions. As for the spatial case, we considered three multiclass datasets, that is, Indian Pines (IP), Pavia University (PU), Pavia Centre (PC), commonly used to benchmark methods for a problem known as land cover classification (LCC), which typically refers to the task of classifying pixels in remotely sensed images, associating each pixel with a label that captures the use of the land it depicts (e.g., forest, urban area, or crop field). Despite being an instance of standard image segmentation, this problem is usually dealt with as an image classification task. Images of this kind are usually captured from satellite or aerial sensors, and a single pixel carries the average electromagnetic response of a wide surface area; as such, the classification of a pixel in the image typically depends only on a close neighborhood of pixels around it. Additionally, the imagery is hyperspectral, that is, it encloses a large number of spatial variables (spectral channels), describing the strength of signals at different electromagnetic wavelengths. In the case of IP, we considered 1600 5 × 5 images (𝑚 = 1600) with 𝑛 = 200 (in this cases, then, variables are named 𝐶1 , … , 𝐶200 ), each labelled with one of 16 classes (𝑘 = 16), while in the case of PC we had 𝑛 = 103 and 𝑘 = 9 (𝑚 = 1600, 𝐶1 , … , 𝐶103 ). In all cases we let 𝑅 = {2}, and, in terms of feature extraction functions: in particular, we limited our search to generalized mean, maximum, and minimum, that is, 𝑙 = 3. For both the temporal and the spatial case we used a variance-based unsupervised filter (𝑉 𝑎𝑟). This method is based on the idea that features with too low variance should be excluded. Examples of such filters include the Variance Threshold method in the ScikitLearn suite [22]. In this step we let 𝑈𝑎𝑔𝑔𝑟 be the identity function (no aggregation was performed), selecting the first half of the total number of the variables (𝐸𝑈 = 0.5). In regard to the supervised selection we applied two different methods for both datasets: Mutual Information (MI) [23] and Fisher Score (FS) [24]. Depending on the problem and task we used different values for 𝐸𝑆 : for all temporal dataset tasks we set 𝐸𝑆 = 0.032 (so that the total number of selected features is 12); for the spatial datasets we used 𝐸𝑆 = 0.02 for IP (the final selection resulting in 12 features) and 𝐸𝑆 = 0.04 for both PC and PU (13). All results are summarized in graphical form in Fig. 3 and 4. In both cases the scores in the unsupervised step decrease quite steeply well before reaching the first half of the set of analyzed features. This indicates that excluding as much as 50% of the features is quite effective, and that this fraction could be increased. After the second step, the value of the selected features can only be estimated: one way to do so is to use each feature in a pairwise, non-parametric hypothesis test (validation step). In the temporal case, the test is binary; in the spatial, multiclass case, instead, we performed one-versus-all tests to identify the class that is best distinguished from the other ones, and its corresponding 𝑝-value. It is crucial to observe that a hypothesis test could not be used in any preceding step as a selection criterium: given its probabilistic nature, performing too many tests renders its results unreliable [18]. This problem still exists during validation, though to a lesser extent. To compensate for it we applied the Bonferroni method for 𝑝-value correction; the corrected 𝑝-values are shown in Tab. 2. As it can be seen, most of the selected features, but not all, passed the test with confidence greater than 95%, and in most of those cases greater than 99%. The fact that not all the selected features are validated indicates that the choice is not trivial, and our approach was able to select a few very informative variable-function-hyperwindow triples from sets of between 600 and 7500. By way of example, the first selection for TA1 in the case of FS corresponds to the total power in lowest fifth of frequencies in the Fourier power spectrum (from Catch22) applied to 𝐹2 on the whole series, while for IP corresponds to the mean value of 𝐶3 applied to the innermost 3 × 3 window. 5. Conclusions Feature engineering and selection are a very well-known problem. Most, if not all the work, how- ever, has been confined to the case of adimensional, tabular data. In this paper, we approached feature engineering and selection for dimensional data. Our proposal is inspired by recent results regarding symbolic learning from dimensional data, but it can be applied independently of any learning method as a way to study dimensional datasets. Conceived as a combination of known techniques, our framework is capable of identifying the most informative variables, the most informative feature extraction functions, and the most informative hyperwindows in a dimensional dataset of any number of dimensions. We performed a series of experiments in the case of two and three dimensions, obtaining quite encouraging results. Our implementation, written in Julia, is open-source and it is part of a bigger project that includes symbolic learning techniques, feature selection, and post-hoc interpretation of symbolic models for non-tabular data. Acknowledgments We acknowledge the support of the project Modal Geometric Symbolic Learning, founded by the University of Ferrara under the FIRD program. References [1] H. M. Abdulwahab, S. Ajitha, M. A. N. Saif, Feature selection techniques in the context of big data: taxonomy and analysis, Applied Intelligence 52 (2022) 13568–13613. [2] J. Cai, J. Luo, S. Wang, S. Yang, Feature selection in machine learning: A new perspective, Neurocomputing 300 (2018) 70–79. [3] S. Solorio-Fernández, J. Carrasco-Ochoa, J. Martínez-Trinidad, A systematic evaluation of filter unsupervised feature selection methods, Expert Systems with Applications 162 (2020) 1–26. [4] S. Huang, Supervised feature selection: A tutorial, Artificial Intelligence Research 4 (2015) 22–37. [5] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering 40 (2014) 16–28. [6] A. Gandomi, M. Haider, Beyond the hype: Big data concepts, methods, and analytics, International Journal of Information Management 35 (2015) 137–144. [7] J. Rodríguez Diez, C. González, H. Boström, Boosting interval based literals, Intelligent Data Analysis 5 (2001) 245–262. [8] Y. Yamada, E. Suzuki, H. Yokoi, K. Takabayashi, Decision-Tree Induction from Time-Series Data Based on a Standard-Example Split Test, in: Proceedings of the 12th International Conference on Machine Learning (ICML), 2003, p. 840–847. [9] A. Brunello, E. Marzano, A. Montanari, G. Sciavicco, J48SS: A novel decision tree approach for the handling of sequential and time series data, Computers 8 (2019) 21. [10] S. Jha, A. Tiwari, S. A. Seshia, T. Sahai, N. Shankar, TeLEx: learning signal temporal logic from positive examples using tightness metric, Formal Methods in System Design 54 (2019) 364–387. [11] X. Li, C. Claramunt, A spatial entropy-based decision tree for classification of geographical information, Transactions in GIS 10 (2006) 451–467. [12] D. Della Monica, G. Pagliarini, G. Sciavicco, I. Stan, Decision trees with a modal flavor, in: Proc. of the 221st Conference of the Italian Association for Artificial Intelligence (AIXA), volume 13796 of LNCS, Springer, 2022, pp. 47–59. [13] F. Manzella, G. Pagliarini, G. Sciavicco, I. Stan, The voice of COVID-19: breath and cough recording classification with temporal decision trees and random forests, Artifcial Intelligence in Medicine 137 (2023) 1–14. [14] G. Pagliarini, G. Sciavicco, Decision tree learning with spatial modal logics, in: Proc. of the 12th International Symposium on Games, Automata, Logics, and Formal Verification (GandALF), volume 346 of EPTCS, 2021, pp. 273–290. [15] J. Halpern, Y. Shoham, A Propositional Modal Logic of Time Intervals, Journal of the ACM 38 (1991) 935–962. [16] P. Balbiani, J. Condotta, L. Fariñas del Cerro, A model for reasoning about bidimensional temporal relations, in: Proc. of the 6th International Conference on Principles of Knowledge Representation and Reasoning (KR’98), Morgan Kaufmann, 1998, pp. 124–130. [17] G. Heinze, D. Dunkler, Five myths about variable selection, Transplant International (2017) 6–10. [18] M. Jafari, N. Ansari-Pour, Why, when and how to adjust your p values?, Cell 20 (2019) 604–607. [19] C. Brown, J. Chauhan, A. Grammenos, J. Han, A. Hasthanasombat, D. Spathis, T. Xia, P. Cicuta, C. Mascolo, Exploring automatic diagnosis of COVID-19 from crowdsourced respiratory sound data, in: Proc. of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020, pp. 3474–3484. [20] S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech and Signal Processing 28 (1980) 357–366. [21] C. Lubba, S. Sethi, P. Knaute, S. Schultz, B. Fulcher, N. Jones, Catch22: Canonical time- series characteristics - selected through highly comparative time-series analysis, Data Mining and Knowledge Discovery 33 (2019) 1821–1852. [22] F. Pedregosa, et. al., Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [23] A. Kraskov, H. Stögbauer, P. Grassberger, Estimating mutual information, Physical review E 69 (2004) 066138. [24] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, 2012.