=Paper= {{Paper |id=Vol-3606/51 |storemode=property |title=(Un)supervised Univariate Feature Extraction and Selection for Dimensional Data |pdfUrl=https://ceur-ws.org/Vol-3606/paper51.pdf |volume=Vol-3606 |authors=Patrik Cavina,Federico Manzella,Giovanni Pagliarini,Guido Sciavicco,Eduard I. Stan,Paola Barra,Zied Mnasri,Danilo Greco,Valerio Bellandi,Silvana Castano,Alfio Ferrara,Stefano Montanelli,Davide Riva,Stefano Siccardi,Alessia Antelmi,Massimo Torquati,Daniele Gregori,Francesco Polzella,Gianmarco Spinatelli,Marco Aldinucci |dblpUrl=https://dblp.org/rec/conf/itadata/CavinaMPSS23 }} ==(Un)supervised Univariate Feature Extraction and Selection for Dimensional Data== https://ceur-ws.org/Vol-3606/paper51.pdf
                                (Un)supervised Univariate Feature Extraction and
                                Selection for Dimensional Data
                                Patrik Cavina1,∗,† , Federico Manzella1,∗,† , Giovanni Pagliarini1,∗,† , Guido Sciavicco1,∗,†
                                and Eduard I. Stan1,2,∗,†
                                1
                                    ACLAI Laboratory, University of Ferrara (Italy)
                                2
                                    DBS Group, Free University of Bozen-Bolzano (Italy)


                                                                         Abstract
                                                                         Feature selection, defined as the automatic selection of the most relevant features from a machine learning
                                                                         dataset, has three objectives: to improve the prediction performance of the predictors, to provide faster
                                                                         and more cost-effective predictors, and to offer a better understanding of the underlying process that
                                                                         generated the data. In the case of dimensional (e.g., temporal or spatial) data, feature selection is often
                                                                         approached using standard methods that neglect or denature the dimensional component. This paper
                                                                         provides a first step towards systematic and general dimensional feature selection, with a portfolio of
                                                                         supervised and unsupervised, filter-based selectors that can be naturally combined into an end-to-end
                                                                         methodology. In a hypothesis-testing setting, our experiments show that our approach can extract
                                                                         provably relevant features in both the temporal and spatial cases.

                                                                         Keywords
                                                                         Feature selection, Filter-based selection, Non-tabular data, Dimensional data




                                1. Introduction
                                Data analysis involves inspecting, cleaning, transforming, and conceptualizing data. The
                                goal is simple: uncover valuable insights, foster informed conclusions, and provide strategic
                                decision-making. This practice unfolds through multiple layers of a data pre-processing pipeline,
                                including feature extraction and selection. Feature extraction is the process of producing mean-
                                ingful features that capture the underlying complexity of data, revealing valuable insights that
                                may be hidden in the raw data by enhancing its predictive power. Feature selection, on the
                                other hand, identifies and selects the most informative subset of features, with the objective
                                of reducing the dimensionality of the data and eliminating noise and redundancy while pre-
                                serving as much information as possible. Sometimes, feature engineering and selection can be
                                combined into a sequential process that first creates several new features and then selects the
                                most informative ones from the resulting extended set.

                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open patrik.cavina@edu.unife.it (P. Cavina); federic.manzella@edu.unife.it (F. Manzella); giovanni.pagliarini@unife.it
                                (G. Pagliarini); guido.sciavicco@unife.it (G. Sciavicco); ioneleduard.stan@unibz.it (E. I. Stan)
                                Orcid 0009-0003-9085-3815 (P. Cavina); 0000-0002-4944-2163 (F. Manzella); 0000-0002-8403-3250 (G. Pagliarini);
                                0000-0002-9221-879X (G. Sciavicco); 0000-0001-9260-102X (E. I. Stan)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Feature selection methods can be classified into filter, wrapper, embedded and hybrid meth-
ods [1]. Filter-based methods are independent of the learning models and base their estimate
of feature importance on heuristic or statistical ranking criteria. Filters can be characterized
in two orthogonal ways: univariate versus multivariate, and supervised versus unsupervised.
Univariate filters rank each feature independently, and thus disregard any form of inter-feature
correlation. Multivariate filters, on the contrary, rank multiple features in batches and can
account for interdependencies between features [1]. Supervised filters estimate the relevance of
a feature (group) with respect to a target variable, that is, the label, while unsupervised ones
select features only according to their own nature and characteristics [2]. Filters can be further
characterized depending on the statistical principle on which they are based: typical choices in
the unsupervised case include filters based on the variance, entropy, or Laplacian score [3], while
supervised ones can be based on correlation/covariance, entropy gain, supervised Laplacian score,
hypothesis test(s), or discriminating power [4]. Wrapper methods, unlike filter ones, use a learning
model to evaluate the performance of different feature subsets; they iterate a search process
until an optimal result, or some stopping condition, is reached, where the search strategies
can be random, sequential, or heuristic [5]. Embedded methods integrate the feature selection
process into the learning model. Hybrid methods are combinations of filters and wrappers that
take advantage of both techniques: the filter reduces the feature set to a good enough subset,
and the wrapper maximizes performance. [1].
   In the era of big data, more-than-tabular data embodies 95% of all existing data [6]. Tabular
data is often found in spreadsheets and relational databases, which can be easily analyzed using
data analytic software. Non-tabular data, such as sets of time series, images, videos, graphs
and texts, is subjective and interpretative, meaning that data analytic tools may not always be
readily accessible, which prompts practitioners to develop ad-hoc tools or enhance existing ones.
A particular, but very common, case of non-tabular data is 𝑑-dimensional data, in which every
instance is represented by an 𝑛-dimensional array of real functions defined on a 𝑑-dimensional
discrete space, covering virtually all cases of real-world temporal (𝑑 = 1), spatial (𝑑 = 2, 𝑑 = 3),
video (𝑑 = 3) data, as well as scalar ones (𝑑 = 0).
   Learning from temporal and/or spatial datasets has been primarily accomplished with sub-
symbolic techniques, specifically neural networks; the current literature on this topic is too
wide to be reviewed here. Recently, however, symbolic learning in both the temporal and the
spatial case has started to be systematically explored. Among several examples, for the temporal
case we mention [7], for extracting temporal rules, [8], for decision tree-based procedures to
classify time series data, [9], in which shapelets are used to classify time series, and [10], in
which classifiers designed to extract point-based temporal logic formulas are used to solve the
classification problem involving time series. Symbolic methods, and in particular propositional
decision trees, have also been applied to spatial data [11]. Recently, symbolic learning from
𝑑-dimensional data has been approached with a rather new technique known as modal symbolic
learning. Modal symbolic learning is based on the idea that propositional logic can be replaced
by modal logic as a tool to describe 𝑑-dimensional instances, and therefore, used to extract
patterns. In modal symbolic learning, 𝑑-dimensional data are analyzed by considering all 𝑑-
dimensional hyperintervals of a dataset as well as their qualitative relationships, and patterns
are described with a suitable modal logic with enough expressive power to represent such
relationships. Although the theoretical properties of modal symbolic learning techniques have
started to be studied only recently [12], modal decision trees and modal random forests have
been successfully applied to several different temporal and spatial datasets [13, 14]. As it
turns out, modal symbolic learning on 𝑑-dimensional data is naturally associated to feature
engineering and selection specifically designed to explore 𝑑-dimensional hyperintervals. In this
work, we focus on designing a univariate filter-based feature extraction and selection technique,
that includes both unsupervised and supervised methods. Such methods are the 𝑑-dimensional
generalization of standard ones, and can be combined into a general protocol to select the most
informative combinations of features and variables, whose informative value can be assessed
independently of learning models.


2. Dimensional Data
Let a tabular instance, also referred to as adimensional instance, be an object of the type 𝐼 =
(𝜈1 , … , 𝜈𝑛 ) where 𝜈𝑖 ∈ ℝ, for each 1 ≤ 𝑖 ≤ 𝑛. Each value 𝜈𝑖 is the value for a variable, whose name
is denoted by 𝑉𝑖 ; in the following, we identify variables with their names. A tabular dataset
(or adimensional dataset) is, therefore, a collection ℐ = {𝐼1 , … , 𝐼𝑚 } of tabular instances whose
values are taken from a set of variables 𝒱 = {𝑉1 , … , 𝑉𝑛 }. Tabular datasets can be also labelled if
each instance 𝐼 is associated to a unique label 𝐿 from a set ℒ = {𝐿1 , … , 𝐿𝑘 }.
   Tabular datasets are classic, scalar datasets as they are usually defined in data science and
machine learning; note that categorical variables can always be transformed into scalar ones
via one-hot encoding. Tabular datasets, however, are a particular case of dimensional ones. A
𝑑-dimensional instance is an 𝑛-tuple of functions 𝐼 = (𝜈1 , … , 𝜈𝑛 ), each defined as

                                            𝜈𝑖 ∶ ℕ𝑑 → ℝ,

and a 𝑑-dimensional dataset a collection ℐ = {𝐼1 , … , 𝐼𝑚 } of 𝑑-dimensional instances each possibly
associated to a unique label. The notion of 𝑑-dimensional dataset generalizes tabular datasets
(when 𝑑 = 0), as well as datasets of time series (𝑑 = 1), images (𝑑 = 2), and videos (𝑑 = 3),
among others. In a 𝑑-dimensional instance too, a variable (function) 𝜈𝑖 has a name 𝑉𝑖 . In the
following, we restrict ourselves to the case of finite datasets, in which we have exactly 𝑁 distinct
points in every dimension.
   Information extraction from 𝑑-dimensional data occurs via using a predetermined set of
feature extraction functions ℱ = {𝑓1 , … , 𝑓𝑙 }. Each function 𝑓𝑖 is abstractly defined as

                                     𝑓𝑖 ∶⋃ℝ𝑗1 × … × ℝ𝑗𝑑 → ℝ.
                                     1≤𝑗1 ,…,𝑗𝑑 ≤𝑁

Examples are the generalized mean, maximum and minimum, but ℱ may include more elaborate
examples whose concrete definition may depend on the relative order of values; an example
in this category in the case 𝑑 = 1 could be the number of local maxima in a given interval of
values. A function 𝑓 is applied to a variable 𝑉 to generate the feature 𝑓 (𝑉 ).
   There are many possible logical representations for a 𝑑-dimensional dataset. A very natural
one is inspired by both the interval temporal logic ℋ 𝒮 [15] and the Rectangle Algebra [16],
which, in turn, can be immediately generalized to the case of 𝑑-dimensions. Let 𝔻 be an initial
prefix of ℕ of cardinality 𝑁, and let 𝔻𝑑 its corresponding 𝑑-dimensional generalization. Let 𝑃𝑗𝑖
      HS modality      Definition w.r.t. the interval structure                           Example
                                                                         𝑥                        𝑦
                                                                                                  𝑤           𝑧
      ⟨𝐴⟩ (after)      [𝑥, 𝑦]𝑅𝐴 [𝑤, 𝑧]   ⇔    𝑦 =𝑤
                                                                                                          𝑤       𝑧
      ⟨𝐿⟩ (later)      [𝑥, 𝑦]𝑅𝐿 [𝑤, 𝑧]   ⇔    𝑦 <𝑤
                                                                         𝑤        𝑧
      ⟨𝐵⟩ (begins)     [𝑥, 𝑦]𝑅𝐵 [𝑤, 𝑧]   ⇔    𝑥 =𝑤 ∧𝑧 <𝑦
                                                                                          𝑤       𝑧
      ⟨𝐸⟩ (ends)       [𝑥, 𝑦]𝑅𝐸 [𝑤, 𝑧]   ⇔    𝑦 =𝑧∧𝑥 <𝑤
      ⟨𝐷⟩ (during)     [𝑥, 𝑦]𝑅𝐷 [𝑤, 𝑧]   ⇔    𝑥 <𝑤 ∧𝑧 <𝑦                      𝑤               𝑧

      ⟨𝑂⟩ (overlaps)   [𝑥, 𝑦]𝑅𝑂 [𝑤, 𝑧]   ⇔    𝑥<𝑤 <𝑦 <𝑧                               𝑤               𝑧

Table 1
Allen’s interval relations and HS modalities. Equality (=) is not displayed.


the (𝑑 − 1)-dimensional discrete space passing through the 𝑗-th point of the 𝑖-th component of
𝔻𝑑 and whose normal is non-null only on that component (e.g., when 𝑑 = 1, 𝑃𝑗1 is simply the
𝑗-th point; when 𝑑 = 2, 𝑃𝑗1 is the straight that goes through the 𝑗-th point and is parallel to the
vertical axis). A hyperinterval 𝐻 is a set
                                       1 , 𝑃1
                                𝐻 = {(𝑃𝑙𝑜𝑤                 𝑑       𝑑
                                          1 ℎ𝑖𝑔ℎ1 ), … , (𝑃𝑙𝑜𝑤𝑑 , 𝑃ℎ𝑖𝑔ℎ𝑑 )}

where 𝑙𝑜𝑤𝑖 ≤ 𝑢𝑝𝑖 , for each 1 ≤ 𝑖 ≤ 𝑑. Thus, this notion simply generalizes the one of interval in
one dimension. Describing a hyperinterval is easily achieved via a 𝑑-tuple of pairs of natural
numbers such as
                                    𝐻 = [(𝑥1 , 𝑦1 ), … , (𝑥𝑑 , 𝑦𝑑 )],
where the 𝑖-th component corresponds to the 𝑖-th dimension, and, as before, 𝑥𝑖 ≤ 𝑦𝑖 , for each
1 ≤ 𝑖 ≤ 𝑑. A notion of a point belonging to a hyperinterval, denoted by (𝑥1 , … , 𝑥𝑑 ) ∈ 𝐻, emerges
naturally. Given the 13 so-called Allen’s relations that allow one to qualitatively describe
the geometric arrangement between any two intervals in a linear order, whose notation and
informal semantics is depicted in Tab. 1, it is immediate to generalize such relations to the case
of hyperintervals. Thus, given any two hyperintervals 𝐻 , 𝐻 ′ , they are related to each other by
exactly one hyperinterval relation 𝑅𝑋1 …𝑋𝑑 , where 𝑋𝑖 ∈ 𝒳 = {𝐴, 𝐴, 𝐿, 𝐿, 𝐵, 𝐵, 𝐸, 𝐸, 𝐷, 𝐷, 𝑂, 𝑂, =},
for each 1 ≤ 𝑖 ≤ 𝑑, and 𝑅𝑋1 𝑋2 …𝑋𝑑 , which generalizes the definition of an Allen’s relation to 𝑑
dimensions (see Fig. 1).
   Given a set 𝒫 of propositional letters, the logic ℋ 𝒮 𝑑 is the 𝑑-dimensional generalization
of the interval temporal logic ℋ 𝒮. Well-formed ℋ 𝒮 𝑑 formulas are obtained by the following
syntax:
                                𝜑 ∶∶= 𝑝 ∣ ¬𝜑 ∣ 𝜑 ∨ 𝜑 ∣ ⟨𝑋1 … 𝑋𝑑 ⟩𝜑,
where, for each 𝑖, 𝑋𝑖 ∈ 𝒳, and ⟨𝑋1 … 𝑋𝑑 ⟩ ≠ ⟨= … =⟩. The remaining Boolean operators, as well
as the universal version of each of the 13𝑑 − 1 existential operators can be defined as a shortcut
via duality.
   The strict semantics of ℋ 𝒮 𝑑 is given in terms of 𝑑-dimensional models of the type 𝑀 =
⟨𝕀(𝔻𝑑 ), 𝜙⟩, where 𝕀(𝔻𝑑 ) is the set of all hyperintervals over 𝔻𝑑 , and 𝑉 is a valuation function
                                      𝑦2                                          𝑦2
                                      𝑧2                                          𝑧2                   𝐻1
              𝐻1     𝐻2                                                                     𝐻2
                                                  𝐻1      𝐻2
                                      𝑤2                                          𝑥2
         𝑥1     𝑤1 𝑦1 𝑧1              𝑥2                                          𝑤2
                                                                         𝑧3 𝑦
                                                                                3 𝑤
                                                                                   3 𝑥        𝑥1 𝑤1    𝑧1 𝑦1
                                            𝑥1     𝑤1 𝑦 1          𝑧1                 3



                    𝐻 1 𝑅𝑂 𝐻 2                         𝐻1 𝑅𝑂𝐷 𝐻2                          𝐻1 𝑅𝐷𝑂𝑂 𝐻2

Figure 1: Examples of hyperintervals and ℋ 𝒮 𝑑 relations for 𝑑 ∈ {1, 2, 3}.


                𝑑
𝜙 ∶ 𝒫 → 2𝕀(𝔻 ) which assigns to every atomic proposition 𝑝 ∈ 𝒫 the set of hyperintervals 𝜙(𝑝)
on which 𝑝 holds. The truth of a formula 𝜑 on a given hyperinterval 𝐻 in an interval model
𝑀, denoted by 𝑀, 𝐻 ⊩ 𝜑, is defined by structural induction on the complexity of formulas as
follows:
      𝑀, 𝐻 ⊩ 𝑝                   if and only if    𝐻 ∈ 𝜙(𝑝), for each 𝑝 ∈ 𝒫 ;
      𝑀, 𝐻 ⊩ ¬𝜓                  if and only if    𝑀, 𝐻⊩𝜓̸ ;
      𝑀, 𝐻 ⊩ 𝜓1 ∨ 𝜓2             if and only if    𝑀, 𝐻 ⊩ 𝜓1 or 𝑀, 𝐻 ⊩ 𝜓2 ;
      𝑀, 𝐻 ⊩ ⟨𝑋1 … 𝑋𝑑 ⟩𝜓         if and only if    there exists 𝐻 ′ s.t. 𝐻 𝑅𝑋1 …𝑋𝑑 𝐻 ′ and 𝑀, 𝐻 ′ ⊩ 𝜓 .
The notion of formula satisfied by a model, satisfiable formula, and valid formula are defined in
the standard way.
  Quite naturally, a 𝑑-dimensional instance described by the variables in 𝒱 can be seen as a 𝑑-
dimensional model, provided that, fixed a set of feature extraction functions ℱ, the propositional
vocabulary is defined as

                     𝒫 = {𝑓 (𝑉 ) ⋈ 𝑣 ∣ 𝑓 ∈ ℱ , 𝑉 ∈ 𝒱 , ⋈ ∈ {<, ≤, =, ≠, ≥, >}, 𝑣 ∈ ℝ},

so that, given a 𝑑-dimensional instance 𝐼, a hyperinterval 𝐻 = [(𝑥1 , 𝑦1 ), … , (𝑥𝑑 , 𝑦𝑑 )], and a variable
𝑉, we can define the value of 𝐼 on 𝑉 in 𝐻, denoted 𝐼 (𝐻 , 𝑉 ) as a hypermatrix in ℝ𝑦1 −𝑥1 +1 ×…×ℝ𝑦𝑑 −𝑥𝑑 +1
whose generic element with indexes 𝑗1 , … , 𝑗𝑑 , with 1 ≤ 𝑗𝑖 ≤ 𝑦𝑖 − 𝑥𝑖 , for all [𝑥𝑖 , 𝑦𝑖 ] ∈ 𝐻, is defined
as:
                           𝐼 (𝐻 , 𝑉 )𝑗1 ,…,𝑗𝑑 = 𝜈(𝑥1 + 𝑗1 − 1, … , 𝑥𝑑 + 𝑗𝑑 − 1).
Then, for a given propositional letter 𝑝 = 𝑓 (𝑉 ) ⋈ 𝜈, we define

                            𝜙(𝑓 (𝑉 ) ⋈ 𝜈) = {𝐻 ∣ 𝐻 ∈ 𝕀(𝔻𝑑 ) and 𝑓 (𝐼 (𝐻 , 𝑉 )) ⋈ 𝜈}.
Modal decision trees and modal random forests, when applied to 𝑑-dimensional datasets, may
extract patterns written in ℋ 𝒮 𝑑 [13, 14]. Therefore, establishing which variables and feature
extraction functions, that is, which features, are informative in which hyperintervals, is a natural
pre-processing step. Typical 𝑑-dimensional datasets are described by hundreds or thousands of
variables in several dimensions, to which tens of feature extraction functions can be applied;
this leads to a number of features in the order of the tens of thousands, from which a choice of
a few units is usually made for further analysis.
                 DATA
                    ℐ
           HYPERWINDOWING                             parameter             description

                    ℐ𝑤                                𝑅           {𝑟1 , … , 𝑟𝑅 }: hyperwindowing
                                                      ℱ           {𝑓1 , … , 𝑓𝑙 }: feature extraction
          FEATURE EXTRACTION                          𝑈           unsupervised algorithm
                    ℐ′                                𝑈𝑎𝑔𝑔𝑟       unsupervised aggregation
        UNSUPERVISED SELECTION                        𝐸𝑈          unsupervised selected fraction
                                                      𝑆           supervised algorithm
                    ℐ″                                𝑆𝑎𝑔𝑔𝑟       supervised aggregation
         SUPERVISED SELECTION                         𝐸𝑆          supervised selected fraction

                        ℐ‴
                       …

Figure 2: Schematic representation of the proposed pipeline and its parametrization.


3. Dimensional Feature Selection
Given a 𝑑-dimensional dataset in which every instance is described by 𝑁 distinct points on each
of the 𝑑 dimensions leads to the need of considering (𝑁 (𝑁 + 1)/2)𝑑 distinct hyperintervals; on
each one of them, one should assess the value of 𝑛 distinct variables under 𝑙 distinct functions.
A few considerations can be immediately drawn: (i) even in the context of filters, exploring
all combinations seems infeasible, and (ii) hypothesis test-based selection should be avoided
altogether, considering their probabilistic nature (that is, considering that performing too many
probabilistic tests on the same data may lead to an unreliable result [17, 18]). However, the po-
tential predictive power of features in 𝑑-dimensional data should be evaluated on hyperintervals,
so that symbolic learning methods can take advantage of them.
   As a tradeoff, let us introduce the notion of hyperwindow. Let 𝔻𝑑 be a finite space defined as
before, and let 𝐻1 , … , 𝐻𝑟 be a set of hyperintervals such that, for every point (𝑥1 , … , 𝑥𝑑 ) ∈ 𝔻𝑑 ,
there is a hyperinterval 𝐻𝑖 (1 ≤ 𝑖 ≤ 𝑟) with (𝑥1 , … , 𝑥𝑑 ) ∈ 𝐻𝑖 ; we call each such hyperinterval a
hyperwindow, and the set 𝐻1 , … , 𝐻𝑟 a hyperwindowing of 𝔻𝑑 . By extracting all features on all
hyperwindows for a given 𝑟 and evaluating them, one is able to approximate the behaviour
of such features on a generic hyperinterval; there are, of course, several ways in which a
hyperwindowing can be produced.
   Based on the idea that a feature can be evaluated on a hyperwindow, unsupervised and
supervised methods in the whole range of classic techniques can be systematically used to this
end. Given a 𝑑-dimensional dataset ℐ, with 𝑛 dimensional variables to which 𝑙 feature extraction
functions can be applied, and given a set of 𝑅 hyperwindow parameterizations {𝑟1 , … , 𝑟𝑅 }, a
high-level description of the entire approach can be given in the following steps:
    1. A 𝑑-dimensional dataset ℐ 𝑤 is built by hyperwindowing the original one; ℐ 𝑤 encom-
       passes Σ𝑅𝑖=1 𝑟𝑖 𝑙 dimensional variables.
    2. A tabular dataset is derived from ℐ 𝑤 , with Σ𝑅𝑖=1 𝑟𝑖 𝑛𝑙 adimensional features. Each feature
       represents the result of applying a feature extraction function to a variable in a particular
       hyperwindow, to which a min-max normalization is applied, to allow comparisons. This
       can be done in several ways: variable-wise, function-wise, function and window-wise, or
       across any combination of variables, depending on the nature of the dataset. Let ℐ ′ be
       the result of this phase.
    3. An univariate unsupervised filter 𝑈 is applied to ℐ ′ . Since values are normalized, an
       unsupervised individual score (𝑈𝑠𝑐𝑜𝑟𝑒 ) can be computed for each triple variable-function-
       hyperwindow. Selecting a pre-determined fraction of pairs based on their 𝑈𝑠𝑐𝑜𝑟𝑒 produces
       a new tabular dataset ℐ ″ .
    4. An univariate supervised filter 𝑆 is then applied to ℐ ″ . As before, each feature-
       hyperwindow pair is assigned a supervised individual score (𝑆𝑠𝑐𝑜𝑟𝑒 ). Selecting a pre-
       determined fraction of triples based on 𝑆𝑠𝑐𝑜𝑟𝑒 produces a new tabular dataset ℐ ‴ .
It should be noted that the unsupervised and the supervised steps are actually independent of
each other; however, the practice suggests that the unsupervised step should be applied first to
eliminate the pairs that most probably do not contain any relevant information, and then, the
supervised step (usually, computationally more expensive) should be applied to select the pairs
that most probably contain information that is relevant to the problem. Furthermore, the above
schema can be further extended by adding suitable aggregation functions: both unsupervised
and supervised selection can be indeed performed after aggregating the triples either by feature
extraction function, variable, hyperwindow, or a combination of them; in fact, it is convenient
to assume that an aggregation function is always applied, whereas the simplest choice is an
identical aggregation function with no effect. Other examples of aggregation functions include
computing the maximum score or the average score across hyperwindows or across functions.
Aggregation can be applied to the end result as well, in order to return the best variables, the
best functions, and the best hyperwindows.
   Summarizing, the following parameters should be set for a given set of experiments: the
set of feature extraction functions (ℱ), the hyperwindowing parametrization ({𝑟1 , … , 𝑟𝑅 }), the
unsupervised filter (𝑈), the supervised filter (𝑆), and two aggregation functions (𝑈𝑎𝑔𝑔𝑟 , 𝑆𝑎𝑔𝑔𝑟 ),
along with the fraction of selected elements after the unsupervised step (𝐸𝑈 ) and the fraction of
selected elements after the supervised one (𝐸𝑆 ). A schematic representation of our approach
and its parametrization can be found in Fig. 2.


4. Experimental Evaluation

   In order to evaluate the effectiveness of our approach we considered two cases, in the temporal
(𝑑 = 1) and spatial (𝑑 = 2) setting, respectively. In both cases the datasets are public and had
been used in the past for several other learning experiments.
   In the temporal case, we use a dataset of audio samples, specifically of cough recordings,
each one is labelled as positive or negative to COVID-19 infection [19]. The dataset is composed
of 9986 audio samples recorded by 6613 volunteers, of which 235 declared to be positive. This
data was originally used to derive smaller datasets, each posing a different form of the same
task of COVID-19 diagnosis depending on the specific set of symptoms and conditions of
the subjects. We refer to these tasks and datasets as TA1, TA2, and TA3. After the original
                        a)
                                     0.20
                                     0.15



                              𝑉 𝑎𝑟
                                     0.10
                                     0.05
                                     0.00
                                             0         1,000            2,000            3,000              4,000             5,000           6,000            7,000
                                     0.18                                                                    0.30
                                     0.16
                              MI




                                                                                                   FS
                                     0.14                                                                    0.25
                                     0.12                                                                    0.20
                                             0    10           20        30         40                              0         10         20        30          40

                        b)
                                       0.2
                              𝑉 𝑎𝑟




                                       0.1
                                       0.0
                                             0         1,000            2,000            3,000              4,000             5,000           6,000            7,000
                                     0.40                                                                     1.6
                                                                                                              1.4
                              MI




                                                                                                              1.2




                                                                                                   FS
                                     0.35
                                                                                                              1.0
                                     0.30                                                                     0.8
                                             0    10           20        30         40                              0         10         20        30          40

                        c)
                                       0.2
                              𝑉 𝑎𝑟




                                       0.1
                                       0.0
                                             0         1,000            2,000            3,000              4,000             5,000           6,000            7,000

                                     0.22                                                                     0.6
                                     0.20
                                                                                                   FS
                              MI




                                     0.18                                                                     0.4
                                     0.16
                                     0.14
                                             0    10           20        30         40                              0         10         20        30          40

Figure 3: Summary graph of each feature selection step of the pipeline for each COVID-19 task: a) TA1,
b) TA2, c) TA3. Each graph shows the score assigned to every feature sorted in non-ascending order.
The blue area highlights the 𝐸𝑈 (unsupervised) and 𝐸𝑆 (supervised) selected features. The scores shown
on the y axis are 𝑉 𝑎𝑟 for Variance, MI for Mutual Information and FS for Fisher Score.

                                1            2         3            4           5          6            7               8          9          10          11           12       13
                       FS    1.3e-17    1.3e-17   1.4e-16      1.0e-16    2.0e-16        3.6e-17   3.6e-17          6.6e-15    6.6e-15    2.3e-15       3.5e-16     3.5e-16          –
         TA3 TA2 TA1




                       MI    4.5e-16    2.2e-13   1.9e-10      5.6e-13    9.3e-15        3.4e-11   3.4e-11          3.4e-13    8.1e-17    1.3e-17       4.5e-15     6.6e-15          –
                       FS     2.6e-9     4.1e-9    3.0e-9       8.2e-9     2.2e-9         2.3e-7    2.3e-7           1.3e-8     2.6e-7     2.6e-7        1.5e-8      5.8e-7          –
 TIME




                       MI     2.2e-9     1.5e-5    3.0e-9       1.4e-6     3.6e-6         2.1e-6    2.0e-7           5.9e-9     2.3e-7     5.6e-7        1.6e-6      5.0e-6          –
                       FS      0.001      0.001     0.001        0.002      0.002          0.002     0.002            0.002      0.002      0.011         0.001       0.002          –
                       MI     21.074      0.078     0.001        0.002      0.192         16.728     0.036            0.051      0.183      0.019         0.067       0.024          –
                       FS    9.0e-61    8.9e-61   8.9e-61      9.9e-61    9.0e-61        8.7e-61   1.0e-60          1.0e-60    1.0e-60    1.1e-60       1.1e-60     1.2e-60         –
         IP




                       MI    1.7e-60    2.5e-59   1.5e-59      5.1e-60    6.6e-60        5.3e-60   4.9e-59          1.8e-60    4.4e-52    6.6e-60       9.3e-50     8.5e-51         –
 SPACE




                       FS    3.4e-56    1.8e-56   3.0e-56      2.2e-56    2.7e-56        3.7e-56   2.8e-56          2.8e-56    4.9e-56    3.0e-56       4.2e-56     5.3e-56   4.7e-56
         PC




                       MI    3.7e-56    4.2e-56   2.7e-56      3.0e-56    1.2e-55        6.2e-56   3.0e-56          2.8e-56    1.2e-55    1.0e-55       4.8e-56     8.6e-56   3.4e-56
                       FS    8.3e-58    8.3e-58   8.3e-58      8.3e-58    8.3e-58        8.3e-58   8.3e-58          8.3e-58    8.3e-58    8.3e-58       8.4e-58     8.3e-58   8.4e-58
         PU




                       MI    3.0e-56    2.3e-57   9.7e-58      1.4e-57    1.2e-57        1.6e-57   1.1e-56          2.9e-56    9.5e-58    1.4e-56       9.2e-58     1.5e-56   9.3e-58

Table 2
𝑝-values of the selected features for COVID-19 (TIME) and LCC (SPACE) tasks.


publication, these tasks were approached and solved in several different way, among which
         a)
                                ⋅10−2
                     1.50
                     1.00


              𝑉 𝑎𝑟
                     0.50
                     0.00
                            0                200         400    600             800         1,000
                      1.9                                              7
                      1.8




                                                               FS
                                                                       6
              MI

                      1.7
                                                                       5
                      1.6                                                  0    10    20     30     40
                            0           10    20   30     40

         b)
                            ⋅10−2
                     6.00
                     4.00
              𝑉 𝑎𝑟




                     2.00
                     0.00
                            0                100        200    300             400         500           600
                     1.45                                             11
              MI




                                                               FS
                     1.40                                             10
                     1.35                                             9
                            0           10    20   30     40               0    10    20     30     40

         c)
                                ⋅10−2
                     6.00
                     4.00
              𝑉 𝑎𝑟




                     2.00
                     0.00
                            0                100        200    300             400         500           600
                     1.35                                             14
                     1.30                                             12
                                                               FS
              MI




                     1.25                                             10
                     1.20                                              8
                            0           10    20   30     40               0    10    20     30     40


Figure 4: Summary graph of each feature selection step of the pipeline for each LCC task: a) Indian
Pines, b) Pavia Centre, c) Pavia University. Each graph shows the score assigned to every feature sorted
in non-ascending order. The blue area highlights the 𝐸𝑈 (unsupervised) and 𝐸𝑆 (supervised) selected
features. The scores shown on the y axis are 𝑉 𝑎𝑟 for Variance, MI for Mutual Information and FS for
Fisher Score.


modal decision trees and forests [13]. In this paper, we considered the tasks TA1 (𝑚 = 343),
TA2 (𝑚 = 71), and TA3 (𝑚 = 55). The temporal interpretation of audio signals emerges after
the initial pre-processing via Mel-Frequency Cepstral Coefficients (MFCC) [20]. In short, MFCC
consists of, first, separating the raw signal into chunks of small size and applying a Discrete
Fourier Transform (DFT) to each of the chunks to produce a spectrogram of the sound at each
chunk, second, converting and representing the frequency spectrum in mel scale, and third,
convolving a set of 𝑛 triangular band-pass filters across the frequency spectrum, discretizing
it into a finite number of frequencies. Ultimately, this produces a multivariate time series
representation where the 𝑛 variables describe the power of the different sound frequencies;
in this experiment, we fixed 𝑛 = 30, that is, the variables are 𝐹1 , … , 𝐹30 . For all these tasks
we set 𝑅 = {1, 3, 6} for a total of 7 hyperwindows for each variable. There are several feature
extraction functions that can be applied to temporal data. A commonly used portfolio is the
so-called Catch22 set [21], encompassing 22 selected functions that have shown the ability of
extracting relevant information from temporal data in several cases; to this set, we add minimum,
maximum, and mean, leading to a set of 𝑙 = 25 feature extraction functions.
   As for the spatial case, we considered three multiclass datasets, that is, Indian Pines (IP),
Pavia University (PU), Pavia Centre (PC), commonly used to benchmark methods for a problem
known as land cover classification (LCC), which typically refers to the task of classifying pixels
in remotely sensed images, associating each pixel with a label that captures the use of the land
it depicts (e.g., forest, urban area, or crop field). Despite being an instance of standard image
segmentation, this problem is usually dealt with as an image classification task. Images of
this kind are usually captured from satellite or aerial sensors, and a single pixel carries the
average electromagnetic response of a wide surface area; as such, the classification of a pixel
in the image typically depends only on a close neighborhood of pixels around it. Additionally,
the imagery is hyperspectral, that is, it encloses a large number of spatial variables (spectral
channels), describing the strength of signals at different electromagnetic wavelengths. In the
case of IP, we considered 1600 5 × 5 images (𝑚 = 1600) with 𝑛 = 200 (in this cases, then, variables
are named 𝐶1 , … , 𝐶200 ), each labelled with one of 16 classes (𝑘 = 16), while in the case of PC
we had 𝑛 = 103 and 𝑘 = 9 (𝑚 = 1600, 𝐶1 , … , 𝐶103 ). In all cases we let 𝑅 = {2}, and, in terms of
feature extraction functions: in particular, we limited our search to generalized mean, maximum,
and minimum, that is, 𝑙 = 3.
   For both the temporal and the spatial case we used a variance-based unsupervised filter
(𝑉 𝑎𝑟). This method is based on the idea that features with too low variance should be excluded.
Examples of such filters include the Variance Threshold method in the ScikitLearn suite [22].
In this step we let 𝑈𝑎𝑔𝑔𝑟 be the identity function (no aggregation was performed), selecting the
first half of the total number of the variables (𝐸𝑈 = 0.5). In regard to the supervised selection
we applied two different methods for both datasets: Mutual Information (MI) [23] and Fisher
Score (FS) [24]. Depending on the problem and task we used different values for 𝐸𝑆 : for all
temporal dataset tasks we set 𝐸𝑆 = 0.032 (so that the total number of selected features is 12); for
the spatial datasets we used 𝐸𝑆 = 0.02 for IP (the final selection resulting in 12 features) and
𝐸𝑆 = 0.04 for both PC and PU (13).
   All results are summarized in graphical form in Fig. 3 and 4. In both cases the scores in the
unsupervised step decrease quite steeply well before reaching the first half of the set of analyzed
features. This indicates that excluding as much as 50% of the features is quite effective, and
that this fraction could be increased. After the second step, the value of the selected features
can only be estimated: one way to do so is to use each feature in a pairwise, non-parametric
hypothesis test (validation step). In the temporal case, the test is binary; in the spatial, multiclass
case, instead, we performed one-versus-all tests to identify the class that is best distinguished
from the other ones, and its corresponding 𝑝-value. It is crucial to observe that a hypothesis
test could not be used in any preceding step as a selection criterium: given its probabilistic
nature, performing too many tests renders its results unreliable [18]. This problem still exists
during validation, though to a lesser extent. To compensate for it we applied the Bonferroni
method for 𝑝-value correction; the corrected 𝑝-values are shown in Tab. 2. As it can be seen,
most of the selected features, but not all, passed the test with confidence greater than 95%,
and in most of those cases greater than 99%. The fact that not all the selected features are
validated indicates that the choice is not trivial, and our approach was able to select a few very
informative variable-function-hyperwindow triples from sets of between 600 and 7500. By way
of example, the first selection for TA1 in the case of FS corresponds to the total power in lowest
fifth of frequencies in the Fourier power spectrum (from Catch22) applied to 𝐹2 on the whole series,
while for IP corresponds to the mean value of 𝐶3 applied to the innermost 3 × 3 window.


5. Conclusions
Feature engineering and selection are a very well-known problem. Most, if not all the work, how-
ever, has been confined to the case of adimensional, tabular data. In this paper, we approached
feature engineering and selection for dimensional data. Our proposal is inspired by recent
results regarding symbolic learning from dimensional data, but it can be applied independently
of any learning method as a way to study dimensional datasets. Conceived as a combination of
known techniques, our framework is capable of identifying the most informative variables, the
most informative feature extraction functions, and the most informative hyperwindows in a
dimensional dataset of any number of dimensions. We performed a series of experiments in the
case of two and three dimensions, obtaining quite encouraging results.
   Our implementation, written in Julia, is open-source and it is part of a bigger project that
includes symbolic learning techniques, feature selection, and post-hoc interpretation of symbolic
models for non-tabular data.


Acknowledgments
We acknowledge the support of the project Modal Geometric Symbolic Learning, founded by the
University of Ferrara under the FIRD program.


References
 [1] H. M. Abdulwahab, S. Ajitha, M. A. N. Saif, Feature selection techniques in the context of
     big data: taxonomy and analysis, Applied Intelligence 52 (2022) 13568–13613.
 [2] J. Cai, J. Luo, S. Wang, S. Yang, Feature selection in machine learning: A new perspective,
     Neurocomputing 300 (2018) 70–79.
 [3] S. Solorio-Fernández, J. Carrasco-Ochoa, J. Martínez-Trinidad, A systematic evaluation
     of filter unsupervised feature selection methods, Expert Systems with Applications 162
     (2020) 1–26.
 [4] S. Huang, Supervised feature selection: A tutorial, Artificial Intelligence Research 4 (2015)
     22–37.
 [5] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical
     Engineering 40 (2014) 16–28.
 [6] A. Gandomi, M. Haider, Beyond the hype: Big data concepts, methods, and analytics,
     International Journal of Information Management 35 (2015) 137–144.
 [7] J. Rodríguez Diez, C. González, H. Boström, Boosting interval based literals, Intelligent
     Data Analysis 5 (2001) 245–262.
 [8] Y. Yamada, E. Suzuki, H. Yokoi, K. Takabayashi, Decision-Tree Induction from Time-Series
     Data Based on a Standard-Example Split Test, in: Proceedings of the 12th International
     Conference on Machine Learning (ICML), 2003, p. 840–847.
 [9] A. Brunello, E. Marzano, A. Montanari, G. Sciavicco, J48SS: A novel decision tree approach
     for the handling of sequential and time series data, Computers 8 (2019) 21.
[10] S. Jha, A. Tiwari, S. A. Seshia, T. Sahai, N. Shankar, TeLEx: learning signal temporal logic
     from positive examples using tightness metric, Formal Methods in System Design 54 (2019)
     364–387.
[11] X. Li, C. Claramunt, A spatial entropy-based decision tree for classification of geographical
     information, Transactions in GIS 10 (2006) 451–467.
[12] D. Della Monica, G. Pagliarini, G. Sciavicco, I. Stan, Decision trees with a modal flavor, in:
     Proc. of the 221st Conference of the Italian Association for Artificial Intelligence (AIXA),
     volume 13796 of LNCS, Springer, 2022, pp. 47–59.
[13] F. Manzella, G. Pagliarini, G. Sciavicco, I. Stan, The voice of COVID-19: breath and
     cough recording classification with temporal decision trees and random forests, Artifcial
     Intelligence in Medicine 137 (2023) 1–14.
[14] G. Pagliarini, G. Sciavicco, Decision tree learning with spatial modal logics, in: Proc. of
     the 12th International Symposium on Games, Automata, Logics, and Formal Verification
     (GandALF), volume 346 of EPTCS, 2021, pp. 273–290.
[15] J. Halpern, Y. Shoham, A Propositional Modal Logic of Time Intervals, Journal of the ACM
     38 (1991) 935–962.
[16] P. Balbiani, J. Condotta, L. Fariñas del Cerro, A model for reasoning about bidimensional
     temporal relations, in: Proc. of the 6th International Conference on Principles of Knowledge
     Representation and Reasoning (KR’98), Morgan Kaufmann, 1998, pp. 124–130.
[17] G. Heinze, D. Dunkler, Five myths about variable selection, Transplant International (2017)
     6–10.
[18] M. Jafari, N. Ansari-Pour, Why, when and how to adjust your p values?, Cell 20 (2019)
     604–607.
[19] C. Brown, J. Chauhan, A. Grammenos, J. Han, A. Hasthanasombat, D. Spathis, T. Xia,
     P. Cicuta, C. Mascolo, Exploring automatic diagnosis of COVID-19 from crowdsourced
     respiratory sound data, in: Proc. of the 26th ACM SIGKDD International Conference on
     Knowledge Discovery and Data Mining (KDD), 2020, pp. 3474–3484.
[20] S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic
     word recognition in continuously spoken sentences, IEEE Transactions on Acoustics,
     Speech and Signal Processing 28 (1980) 357–366.
[21] C. Lubba, S. Sethi, P. Knaute, S. Schultz, B. Fulcher, N. Jones, Catch22: Canonical time-
     series characteristics - selected through highly comparative time-series analysis, Data
     Mining and Knowledge Discovery 33 (2019) 1821–1852.
[22] F. Pedregosa, et. al., Scikit-learn: Machine learning in Python, Journal of Machine Learning
     Research 12 (2011) 2825–2830.
[23] A. Kraskov, H. Stögbauer, P. Grassberger, Estimating mutual information, Physical review
     E 69 (2004) 066138.
[24] R. Duda, P. Hart, D. Stork, Pattern Classification, Wiley, 2012.