The Dilemma Between Data Transformations and
                  Adversarial Robustness for Time Series Application Systems
                                                  Sheila Alemany, Niki Pissinou
                                            School of Computing and Information Sciences,
                                                    Florida International University
                                                salem010@fiu.edu, pissinou@fiu.edu


                            Abstract                                 are incomplete and instantaneous representations of infor-
                                                                     mation, trained machine learning models contain many ar-
  Adversarial examples, or nearly indistinguishable inputs cre-      eas within it with low confidence. These low confidence ar-
  ated by an attacker, significantly reduce machine learning ac-     eas of knowledge can be mapped similarly to how a human
  curacy. Theoretical evidence has shown that the high intrinsic
  dimensionality of datasets facilitates an adversary’s ability to
                                                                     can be less sure of a correct answer for unfamiliar contexts.
  develop effective adversarial examples in classification mod-      Adversaries exploit these low-confidence areas and create
  els. Adjacently, the presentation of data to a learning model      a minor input change possible to skew the model’s recom-
  impacts its performance. For example, we have seen this            mendations or decisions to be wrong or inaccurate. Despite
  through dimensionality reduction techniques used to aid with       these observations (Ilyas et al. 2019; Goodfellow, McDaniel,
  the generalization of features in machine learning applica-        and Papernot 2018), the existence of adversarial examples
  tions. Thus, data transformation techniques go hand-in-hand        remains an open problem (Shafahi et al. 2019; Hendrycks
  with state-of-the-art learning models in decision-making ap-       et al. 2021). However, these proposed theories continuously
  plications such as intelligent medical or military systems.        approach similar conclusions: the vulnerability of ML mod-
  With this work, we explore how data transformations tech-          els is highly correlated to how the data is represented.
  niques such as feature selection, dimensionality reduction, or
  trend extraction techniques may impact an adversary’s abil-           In practice, data is repeatedly being transformed with a
  ity to create effective adversarial samples on a recurrent neu-    growing list of pre-processing techniques to optimize ML
  ral network. Specifically, we analyze it from the perspective      models (Aleman et al. 2018; Naranjo and Santos 2019;
  of the data manifold and the presentation of its intrinsic fea-    Huang and Zhou 2019), and these techniques transform the
  tures. Our evaluation empirically shows that feature selection     way data is presented to an intelligent system. Thus, based
  and trend extraction techniques may increase the RNN’s vul-        on existing work, we hypothesize that data transformations
  nerability. A data transformation technique reduces the vul-       may directly impact the adversary’s ability to create adver-
  nerability to adversarial examples only if it approximates the     sarial samples due to manipulations in representing the in-
  dataset’s intrinsic dimension, minimizes codimension, and          trinsic features of data. Motivated by the direct impact that
  maintains higher manifold coverage.
                                                                     this may have on currently deployed systems, we explore
                                                                     how five widely-applied data transformation techniques af-
                     1     Introduction                              fect the robustness1 of recurrent neural networks.
                                                                        We consider techniques that span three different data
As the application of ML grows in industries that require            transformation categories: dimensionality reduction (Prin-
explainable and reliable ML models, there is a significant           cipal component analysis (Shlens 2014)), feature selection
concern on the immense fragility in neural networks when             (random forest (Golay and Kanevski 2017) and low vari-
given a varying size set of imperceptibly perturbed inputs,          ance (Bramer and Devedic 2004)), and trend extraction (can-
adversarial examples (Biggio and Roli 2018; Su, Vargas,              dlestick charting (Chmielewski et al. 2015) and exponential
and Sakurai 2019; Elsayed et al. 2018). To address this              moving average (Klinker 2011)). Our empirical evaluation
issue, many pioneering works have focused on solutions               aims to identify whether data transformation techniques in
that increase the models’ robustness to maintain high accu-          the three categories can impact the efficiency of an adver-
racy assuming the existence of these adversarial examples            sarial attack. To better understand this, we design our exper-
(Biggio and Roli 2018; Ilyas et al. 2019; Goodfellow, Mc-            iments to explore the following questions:
Daniel, and Papernot 2018; Hendrycks et al. 2021). The so-
lutions proposed in these works have observed adversarial            1. Could data transformations contribute to any adversary’s
examples from the perspective of the abstractions created               ability to more easily construct adversarial examples
by the machine learning models. But, since these datasets               (i.e., make the ML model more vulnerable to attacks)?
                                                                        1
Copyright © 2022 for this paper by its authors. Use permitted un-        In this work, robustness refers to the adversary’s decreased ca-
der Creative Commons License Attribution 4.0 International (CC       pacity to attack more efficiently or induce inaccurate results using
BY 4.0).                                                             ”harder-to-detect” perturbations.
2. Is the dimensionality reduction technique, PCA, con-                   Data dimensionality has been referred to as a “curse”
   sistent as a strategy to increase robustness, as seen in           due to substantial computational complexity yielding diffi-
   Bhagoji et al. (2018), when given a time series dataset,           culties when abstracting properties in data that do not oc-
   recurrent neural network, and varying selected principal           cur in lower-dimensional data (Van Der Maaten, Postma,
   components?                                                        and Van den Herik 2009; Ilyas et al. 2019; Bhagoji et al.
3. What representations of data contribute to ML models               2018). Resulting in data transformations techniques often
   that are least susceptible to adversarial examples and how         being used in learning systems to improve upon these bur-
   can we use them to ensure best practices when manipu-              dens (Cheng and Lu 2018). Naturally, data transformations
   lating data?                                                       have influenced the field of adversarial ML due to the con-
                                                                      nection between adversarial vulnerability in deep learning
Overall, in this work, we expand the empirical understand-            and the high dimensionality of data. These techniques in-
ing of how data transformation techniques may impact the              crease robustness by modifying the input such that the im-
robustness of a recurrent neural network given the Carlini            pact of gradient-based attacks is reduced, either through ad-
& Wagner (Carlini and Wagner 2017b) evasion attack on a               versarial pre-training (Hendrycks, Lee, and Mazeika 2019),
multi-variate time series dataset (Banos et al. 2015). This           feature squeezing (Xu, Evans, and Qi 2018), dimensional-
benefits ML practitioners as they can use the presented re-           ity reduction with PCA (Bhagoji et al. 2018), or identifying
sults to move towards better data practices when manipulat-           and removing the least “robust features” which contribute
ing data increasingly used in deployed intelligent systems.           the most to a model’s vulnerability (Ilyas et al. 2019). Thus,
This is the first work exploring whether certain data trans-          they are defenses that focus on executing certain transfor-
formations (outside of dimensionality reduction) may im-              mations at the beginning of the ML pipeline, such that when
pact robustness in time series ML models to the best of our           the adversary gains perfect knowledge of the trained model,
knowledge.                                                            it is more difficult for an adversary to optimize its attack.
                                                                          Carlini and Wagner (2017a) showed how certain previ-
                    2     Related Work                                ously described techniques, including (Bhagoji et al. 2018),
                                                                      were not a consistent defense. For example, they were able
Many pioneering works have established a foundation for
                                                                      to show how using PCA in the training data did not increase
the seemingly inherent vulnerability to adversarial exam-
                                                                      the robustness of a convolutional neural network, only the
ples. Szegedy et al. (2014) argued the existence of low-
                                                                      fully-connected network. Other works had inconsistencies
probability adversarial “pockets” that an adversary can take
                                                                      in their presented results when tested on other datasets. Ob-
advantage of. Feinman et al. (2017) established that adver-
                                                                      serving these inconsistencies and how the representation of
sarial samples lie furthest away from the data manifold2 and
                                                                      data highly influences abstractions, we hypothesize that dif-
are restricted in the direction normal to the data manifold
                                                                      ferent data transformations may individually impact the rep-
such that the adversarial examples cross the decision axis
                                                                      resentation of the intrinsic features and hence, uniquely im-
(the optimal boundary between the data manifolds captured
                                                                      pact an adversary’s ability to attack the model.
during model training time) and result in an incorrect output
(Khoury and Hadfield-Menell 2018).
   Shafahi et al. (2019) and Ilyas et al. (2019) proposed                   3    Data Transformation Techniques
that the vulnerabilities to adversarial examples stem from            Our comparative review includes data transformation tech-
the foundational characteristic in ML that the training data          niques during the pre-processing stage of the ML pipeline.
accurately and adequately represents the underlying and               It is not exhaustive. We have strictly focused on linear data
abstracted phenomena through the learning process. Such               transformation techniques that have been commonly used
high dimensional abstractions3 allow adversaries to exploit           in a variety of applications (Aleman et al. 2018; Bhagoji
through minor and specific details that a trained ML model            et al. 2018; Carlini and Wagner 2017a). For brevity, we
can overlook. Similarly, Amsaleg et al. (2020) showed that            assume the reader understands the each technique. Future
the intrinsic dimensionality of datasets and an adversary’s           work can be focused on non-linear dimensionality reduc-
ability to develop effective adversarial examples are directly        tion techniques. We keep both works separate as non-linear
proportional in classification models. This is so as a higher         transformations may impact the complexity of data mani-
intrinsic dimensionality results in higher model complexity.          folds differently than linear ones.
In all cases, the quality of the abstractions is limited to how
the data is presented to the model (i.e., does the data have          Dimensionality Reduction Dimensionality reduction is
bias? Is it missing values? Does it contain noise? etc.). This        the transformation of high-dimensional data into a signif-
is because ML learning/generalization and adversarial ex-             icant representation of low dimensionality (Cheng and Lu
ample creation remains a classic optimization problem.                2018). Principal component analysis (PCA) is by far one
                                                                      of the more popular unsupervised tools due to its simple,
   2
     Data manifold is defined as the geometry of the data which       non-parametric method for extracting relevant information
contains a topological space that locally resembles the Euclidean     from overwhelming datasets (Shlens 2014). For this work,
space near each data value.                                           we consider using 27%, 50%, and 81% of the principal com-
   3
     Highly dimensionality of a model is not only correlated to the   ponents to approximate the feature counts around the 25, 50,
model architecture/parameters but also the dataset being used (Su,    and 75 quartiles. We explore in Section 6 how the selected
Vargas, and Sakurai 2019).                                            principal components in varying extremes can significantly
change the data manifold in ways which impact robustness.         Knowledge We use a white-box attack where the adver-
                                                                  sary has full access to the trained neural network model, the
Feature Selection Feature selection is a data transforma-
                                                                  defense used, along with the data distribution at test time.
tion technique that has been used for decades to represent
                                                                  We consider this attack because white-box attacks are more
particular relationships in data by eliminating features that
                                                                  powerful than black-box attacks, as a white-box attack can
may be irrelevant or redundant (Dash and Liu 1997) based
                                                                  reach a 100% success rate. Additionally, we consider eva-
on a varying set size of heuristics. These techniques compare
                                                                  sion attacks where the adversaries can attack only during
to dimensionality reductions methods in that they do not
                                                                  model deployment, meaning that they tamper with the input
map onto a lower-dimensional space. For this work, we have
                                                                  data after the deep learning model is trained.
selected random forest selection (Golay and Kanevski 2017)
and low variance selection (Bramer and Devedic 2004) due          Capabilities For the attack method, we use the iterative
to their high usage for their low computational requirements.     optimization-based method of Carlini and Wagner (2017b).
   For random forest selection, we set the feature importance     We selected this attack model due to its high success at craft-
measure threshold to be the mean of all importance val-           ing effective adversarial samples with the lowest distortion
ues, as it is standard in practice (Golay and Kanevski 2017).     (Carlini and Wagner 2017b). Specifically, we have used the
For low variance selection, the selected features contributed     Carlini & Wagner l∞ implementation from the Adversarial
91.1% of the total variance in the data, as it is said to be      Robustness Toolbox by IBM Research (Nicolae et al. 2018).
the best heuristic to approximate the most significant infor-     Some minor hyperparameters were modified to create ad-
mation of a dataset (Van Der Maaten, Postma, and Van den          versarial attacks that reduced the accuracy of our model are
Herik 2009). Although random forest selection considers the       the learning rate and confidence, set to 0.01 and 0.5, respec-
relationship of features with the target variable and low vari-   tively.
ance selection does not, both techniques chose 9 overlapping
                                                                  Goal To create effective adversarial examples, we use the
features. Thus, we expect their impact on data manifolds to
                                                                  l∞ distortion metric to measure the similarity between the
be similar even with their varying heuristics for feature se-
                                                                  benign and potential adversarial examples since the l∞ -ball
lection.
                                                                  around each data point has recently been studied as an op-
Trend Extraction Up-to-date works have focused on im-             timal, natural notion for adversarial perturbations (Goodfel-
age recognition tasks concerning robustness, but time series      low, Shlens, and Szegedy 2014; Carlini and Wagner 2017b).
data is also highly used in ML applications. As a result, we      For this work, we used the untargeted attack and considered
have analyzed the impact of data transformation techniques        0 < ϵ ≤ 1 (Tjeng, Xiao, and Tedrake 2019). Although tar-
meant to extract trends in time series data, such as candle-      geted attacks are more powerful concerning the attack suc-
stick charting (Chmielewski et al. 2015) and exponential          cess rate, we are considering an untargeted attack since these
moving average (EMA) (Klinker 2011).                              attacks require a more limited perturbation budget that al-
   These techniques were selected as they are used in pre-        lows for an adversary to efficiently deploy the attack un-
diction tasks in areas such as financial markets (Naranjo and     detected (Carlini and Wagner 2017b). We can visualize the
Santos 2019), IoT (Aleman et al. 2018), and object tracking       perturbation under this distance metric by viewing a series
(Huang and Zhou 2019). These techniques affect the data           of data points. There is a maximum perturbation budget of ϵ,
manifold by smoothing the trends in time series data, simi-       where the sum of all perturbations is allowed to be changed
larly to feature squeezing for image recognition (Xu, Evans,      by up to ϵ, with no limit on the number of modified val-
and Qi 2018), by artificially reducing the distance between       ues. Since perturbation budget has to remain less than some
temporally adjacent points that provide better estimation of      small ϵ, even if all values are modified, the trends in time
their distance along the manifold. For this work, to ensure       series data will appear visually identical.
we are similarly comparing both trend extraction techniques,
both were assigned the same value for the time window. The                      5    Experimental Methods
time window value of 20 was a selected hyperparameter that        We compare our evaluation results with previous works
would not reduce the dimensionality of the dataset enough to      that have completed similar tests with the computer vision
hinder the model accuracy for the candlestick charting tech-      datasets, CIFAR-10 (Krizhevsky 2009) and MNIST (LeCun,
nique but would cause a significant enough change to the          Cortes, and Burges 2010), to check for overall consistency
feature trends given the EMA technique.                           in the impact done by data transformation techniques.
                   4    Threat Model                              Dataset The focus of related adversarial evaluation is
                                                                  largely centered around image recognition tasks. However,
As per Carlini et al. (2019), we define the adversary’s knowl-
                                                                  there are high dimensional time series datasets that have
edge, capabilities, and goals to ensure analysis for worst-
                                                                  received little attention in the adversarial ML field and
case robustness. We did not implement any additional de-
                                                                  the need for evaluation on other datasets is crucial for
fenses as our goal for this work is to explore the impact
                                                                  the advancement of the area (Carlini and Wagner 2017a).
of these techniques for small perturbation budgets that are
                                                                  As a result, we have used the MHealth (Mobile Health)
difficult to detect using the current state-of-the-art defenses
                                                                  Dataset4 which contains body motion and vital signs record-
(Tjeng, Xiao, and Tedrake 2019). Considering the attack
success rate with incorporated defenses and data transfor-            4
                                                                        Dataset available on the UCI ML Repository at https://archive.
mation techniques is left for future work.                        ics.uci.edu/ml/datasets/MHEALTH+Dataset
                               75                         0                  10                                      0                                               0
                                                          1                                                          1                 40                            1
                               50                         2                                                          2                                               2
                                                          3                   5                                      3                 20                            3
                               25                         4                                                          4                                               4


                Component 1


                                                               Component 1


                                                                                                                         Component 1
                                0                         5                                                          5                                               5
                                                          6                                                          6                  0                            6
                                                                              0
                               25                         7                                                          7                                               7
                                                          8                                                          8                 20                            8
                               50                         9                   5                                      9                                               9
                                                          10                                                                           40
                               75                         11
                                                          12                 10
                              100                                                                                                      60
                                    75 50 25 0 25 50 75                           10      5       0       5   10                            40 30 20 10 0 10 20 30
                                         Component 2                                          Component 2                                           Component 2

                                    (a) MHealth Dataset                                (b) MNIST Dataset                                     (c) CIFAR-10 Dataset

Figure 1: Visualization of datasets using T-SNE to observe the relationships between the points in high-dimensional space
using 1000 randomly selected points from each dataset. MHealth shows that various clusters can be easily identified, such as
the points in classes 1, 2, and 3, similar to MNIST. Yet, there are clusters such as for classes 8 and 12, where the points are
more scattered, similar to CIFAR-10.


ing of individuals while performing several physical activ-                                               tangent function in these hidden vectors as it is a standard
ities (Banos et al. 2015). This highly volatile dataset con-                                              activation function among recurrent neural networks (Chol-
tains 22 total features which map to one of 12 potential                                                  let et al. 2015). The dropout values were set to 0.1, depict-
physical activities and we selected the data corresponding                                                ing that 10% of each input was ignored to prevent the model
to subject1 with a total of 160,860 timestamps.                                                           from overfitting to the training data. Lastly, we are not con-
   Figure 1b shows that the MNIST dataset contains the                                                    cerned about our network’s simple linear structure because
most well-defined classes meaning points corresponding to                                                 it is claimed that the network’s simple structure architecture
the same class are clustered together more frequently. This                                               does not impact their Carlini & Wagner evasion attacks Car-
implies that the points within each class of the MNIST                                                    lini and Wagner (2017b).
dataset have highly correlated relationships even with the
highly-dimensional dataset. On the contrary, in Figure 1c,                                                     6   Robustness Against Evasion Attacks
the CIFAR-10 dataset does not have well-defined clusters
                                                                                                          Since the data manifold structure heavily influences the ex-
resulting in an almost opposite conclusion relative to the
                                                                                                          istence of adversarial examples and how these adversarial
MNIST dataset. As a result, CIFAR-10 has been described
                                                                                                          attacks are optimized, we observe the changes in model per-
as a substantially more difficult dataset to work with. There-
                                                                                                          formance from the perspective of the data manifold. To com-
fore, conclusions made with MNIST may contain prop-
                                                                                                          pare the changes made to the manifold by the data trans-
erties that do not generalize across tougher datasets such
                                                                                                          formations, we observe the codimension or the difference
as CIFAR-10 (Carlini and Wagner 2017a). However, the
                                                                                                          between the dimension of the data manifold and the dimen-
MHealth dataset lies between the MNIST and CIFAR-10
                                                                                                          sion of the embedding space 5 (Khoury and Hadfield-Menell
dataset in regards to the relationship between the points in
                                                                                                          2018). We show only perturbation budget 0 to 1 to show the
high-dimensional space. Thus, we are testing with a realis-
                                                                                                          impact given small perturbations since the concluding re-
tic time series dataset that contains manifold properties that
                                                                                                          sults do not change as the attack success continues increas-
may carry-out to various other highly-dimensional time se-
                                                                                                          ing.
ries datasets. As a result, we believe our evaluation using the
MHealth dataset is a valid example that brings to light the                                               Manifold Impacts on Log Loss & Precision
observations presented in this work.
                                                                                                          From Figure 3 on the next page, we can see that precision
Learning Model Data pre-processing includes processes                                                     is consistently below baseline for both feature selection and
such as data cleaning, normalization, transformation, fea-                                                trend extraction techniques. The low log loss and precision
ture extraction, selection, and is the step done before train-                                            indicates that these models are overly confident but erro-
ing in this work. For the learning model, we have imple-                                                  neous implying a closer proximity between submanifolds to
mented a multi-class classification recurrent neural network                                              the decision axis (Wu et al. 2017). In other words, when
(RNN) with LSTM layers using Keras (Chollet et al. 2015).                                                 the submanifolds are closer to the decision boundary, the
Network architecture and hyperparameter tuning were com-                                                  distance between two arbitrary points in different classes is
pleted to guarantee that all trained models for each data                                                 lower relatively. Thus, when an ML model is tasked with cat-
transformation technique received the same hyperparame-                                                   egorizing a new point, it will often confidently miscategorize
ters while maintaining testing accuracy above 90% to ensure                                               since it is “harder” to differentiate between the two candi-
that the network architecture did not influence robustness re-                                            date classes. From the perspective of an adversary, they now
sults. The network contained contain only two LSTM units
combined with dropout layers which showed to return satis-                                                   5
                                                                                                               The embedding space is the space in which the data is embed-
factory training and testing results. We used the hyperbolic                                              ded after dimensionality reduction.
                                         70                                                                                            6         Baseline
                                         60                                                                                                      PCA 50%


             Attack success with l (%)
                                                                                                                                       5         PCA 27%
                                         50                                                                                                      PCA 81%
                                                                                                                                       4
                                                                                                                                                 Random forest


                                                                                                                            Log loss
                                         40                                                                                                      EMA
                                                                                                                                       3
                                         30                                                                                                      Candlestick
                                                                                                                                       2         Low variance
                                         20
                                         10                                                                                            1

                                          0                                                                                            0
                                               0.0   0.2        0.4         0.6         0.8           1.0                                  0.0         0.2            0.4         0.6            0.8             1.0
                                                            Perturbation budget ( )                                                                               Perturbation budget ( )

Figure 2: Attack success and log loss scores given five data transformation techniques against the baseline model without pre-
processing. We can see that the best performing technique was PCA using half of the principal components. However, the log
loss scores corresponding to model confident shows the all PCA techniques returned the lowest confidence when ϵ > 0.57.

                                                                                                                                                                                                         Baseline
                                         1.0
                                                                                      Baseline                              0.275
                                                                                      PCA 50%                                                                                                              EMA
                                         0.8                                          PCA 27%


                                                                                                       Precision at = 0.8
                                                                                                                            0.250
                   Precision scores


                                                                                      PCA 81%
                                         0.6                                          Random forest                         0.225
                                                                                      EMA
                                                                                      Candlestick                           0.200                                                                       Candlesticks
                                         0.4                                          Low variance
                                                                                                                            0.175               PCA 50%
                                                                                                                                        Random forest
                                         0.2                                                                                0.150PCA 27%           Low variance                         PCA 81%

                                               0.0   0.2       0.4        0.6         0.8       1.0                                         6      8         10     12      14    16        18     20       22
                                                           Perturbation budget ( )                                                                                   Feature count

Figure 3: Precision scores under-performed for all techniques once the perturbation budget was over ϵ = 0.68. From the scatter
plot, we can see that reducing the number of features during training negatively impacted the precision scores given a high
enough perturbation budget.


require a minimal perturbation budget to “convince” the ML                                                                             of. However, as Carlini and Wagner (2017a) already showed
model to miscategorize incoming data points consistently                                                                               this is not consistent given a convolutional neural network
with high confidence. However, this is not the case with                                                                               and, for our evaluation, it seems it is may not always consis-
PCA. With PCA, the precision is improved when ϵ < 0.65                                                                                 tent with our recurrent neural network.
due to relatively better defined submanifolds as a direct re-                                                                             The other PCA techniques using 27% and 81% of the
sult of mapping the input embedding into a lower dimen-                                                                                principal components did not perform as well once the per-
sion. The reduced precision for greater values of epsilon is                                                                           turbation budget exceeded ϵ = 0.1. Particularly, using only
then introduced when the log loss of the model increases                                                                               27% of the principal components results losing too many
because linear units can get low precision from responding                                                                             dimensions which can in turn reduce the manifold coverage
too strongly from a reduced confidence when it does not                                                                                for the dataset. This lack of coverage makes it is much easier
understand samples with larger perturbations (Goodfellow,                                                                              for an adversary to find an example far away from the data
Shlens, and Szegedy 2014).                                                                                                             manifold (Feinman et al. 2017). This can happen easily in
                                                                                                                                       practice since high training/testing accuracy does not imply
   Takeaway 1.1: PCA creates more well-defined sub-                                                                                    high accuracy/coverage of the data manifold (Khoury and
   manifolds for each class such that it is more difficult                                                                             Hadfield-Menell 2018). On the other hand, when using 81%
   for an adversary to “trick” an ML model with an im-                                                                                 of the principal components, there is high codimension re-
   perceptible adversarial example. This is not the case                                                                               sulting in relatively more directions normal to the manifold
   for feature selection and trend extraction techniques.                                                                              and directly contributing to a more efficient attack. Thus, we
                                                                                                                                       can conclude that an optimal codimension exists in datasets
Manifold Impacts on Model Accuracy                                                                                                     such that the vulnerabilities presented are minimized.
From Figure 2, it is clear the attack success rate is only hin-
dered by 24.39% when the PCA technique is used with half                                                                                    Takeaway 2.1: The dimensionality reduction tech-
of its principal components. Bhagoji et al. (2018) proposed                                                                                 nique, PCA, is not a consistent defense against ad-
that PCA should consistently increase robustness because                                                                                    versarial examples when the codimension is not op-
PCA removing the high variance components should elimi-                                                                                     timal.
nate the features that adversaries can easily take advantage
Table 1: Summary of results: Columns from left to right present the data transformation technique, the number of features used
from the original data, its clean accuracy when the model is not under attack, the perturbation budget required to attack success
to 30%, and percentage change in robustness at ϵ = 0.80 relative to the baseline model with no data transformation applied to
its training data.

                Data Transformation       Feature Count      Benign Accuracy    Distance (l∞ )   ∆ in Robustness
                      Baseline                 22                97.93%              0.51                -
                     PCA 50%                   11                96.71%              0.40           ↑ 24.39%
                     PCA 81%                   18                98.80%              0.76           ↓ 43.90%
                     PCA 27%                    6                95.00%              0.34           ↓ 60.98%
                  Random Forest                 9                96.11%              0.13           ↓ 31.71%
                   Low Variance                11                91.32%              0.15           ↓ 65.85%
                   Candlesticks                22                92.78%              0.11           ↓ 60.98%
                       EMA                     22                96.48%              0.51            ↓ 7.32%


   The feature selection techniques behaved similarly (as ex-           Reaching this ideal data representation can be done by
pected) given both techniques selected a majority of the             identifying the intrinsic dimension of a dataset. The intrinsic
same features. In both cases, since no mapping to a lower di-        dimension is defined as a potential solution from the codi-
mension occurs and a majority of the features are removed,           mension of solutions sets (Li et al. 2018). In other words,
the model contains high codimension and a lack of mani-              it can be described as the minimum number of parameters
fold coverage relative to the dimensionality reduction. As a         necessary to account for the observed properties in the data,
result, feature selection aids an efficient adversarial attack       achieve optimal ML performance accuracy and, a way to re-
through all tested perturbation budgets.                             duce codimension.

   Takeaway 2.2: Feature selection techniques con-                      Takeaway 3.1: ML practitioners can reduce codimen-
   tribute to higher codimension, and they lack manifold                sion in their models using the intrinsic dimension of
   coverage results in an adversary’s ability to construct              their dataset.
   adversarial examples more easily.
                                                                     Finding and Using Intrinsic Dimension The geometry
   The trend extraction techniques, however, do not remove           of the data manifold, or the dataset’s intrinsic dimensional-
the features used but manage to force the data into a lower          ity, is generally twisted and curved with non-uniformly dis-
dimensional manifold by generalizing the trends that nor-            tributed points, making identifying the intrinsic dimension-
mally contribute to the high dimensionality in trained mod-          ality a challenging task unique for each dataset (Facco et al.
els (Xu, Evans, and Qi 2018). For the candlestick chart-             2017). There are various tools and algorithms to analyze the
ing, the transformation into the four-tuple reshaped the fea-        intrinsic characteristics, such as the intrinsic dimensional-
tures but contributed to fundamental information loss for the        ity of data. For example, the most straightforward way by
dataset. The information loss resulted on the higher relative        counting the number of features that contribute at least 90%
end of codimension and the one of most efficient creation            of the total variance (Van Der Maaten, Postma, and Van den
of adversarial examples with a 60.98% decrease in robust-            Herik 2009). With datasets and ML models that are more
ness at ϵ = 1.0. However, EMA seemed to not smooth                   complex, (Li et al. 2018) proposed to measure the intrinsic
the manifold enough for a drastic change from the baseline           dimension of an “objective landscape” or the dimension of
data. Therefore, no statistically significant change to the data     the subspace of a parameterized model, such as a dataset
manifold results in a performance on par with the baseline.          or neural network. They do so by training a neural network
                                                                     from a small, randomly oriented subspace and slowly in-
   Takeaway 2.3: Candlesticks charting contributes to                creasing its dimension (through added features or parame-
   the most vulnerable ML models due to information                  ters) until they reach a plateau of performance accuracy, and
   loss which significantly increases codimension.                   define that configuration to be the objective landscape’s in-
                                                                     trinsic dimension.
                                                                        To measure the intrinsic dimension of the MHealth
Optimal Data Representations                                         dataset, we used both of these techniques, (Van Der Maaten,
From our experimentation, we were able to see that the data          Postma, and Van den Herik 2009) and (Li et al. 2018). Us-
transformation techniques which did not minimize codimen-            ing (Van Der Maaten, Postma, and Van den Herik 2009),
sion aided in allowing pathways for adversaries to exploit.          11 features contribute to approximately 91% of the total
The difficulty arises when transformations do not always and         variance. Using (Li et al. 2018), we sorted the features by
consistently impact the codimension. This prompted us to             descending variance and trained the same RNN one fea-
ask the following question: how do we know how and what              ture at a time and noticed the plataeu began with 9 features
transformation to execute to ensure that the codimension is          at approximately 94% performance accuracy. Overall, from
not increased for an arbitrary dataset?                              these simple tests, we can see that the intrinsic dimension
for the MHealth dataset is approximately between [9, 11],        the technique embeds the high-dimensional input space into
likely closer to 9 due to the complexity of the model and the    a lower-dimensional structure that approaches the intrinsic
looser bounds presented by the (Van Der Maaten, Postma,          dimension of data. Specifically, PCA overperformed only
and Van den Herik 2009) heuristic. Since the same neural         when the dimensionality approached the intrinsic dimen-
network architecture and parameters are used for all trans-      sion. Meanwhile, the trend extraction techniques that re-
formation techniques, its contribution to the intrinsic dimen-   frained from sufficiently reaching the intrinsic dimension
sionality is out of scope of this evaluation. However, ML        showed to negatively impact the attack success and the pre-
practitioners can incorporate the technique for future param-    cision scores, overall making the ML model more vulnera-
eter configurations into their pipelines with ease.              ble to adversarial examples. Although we only considered
                                                                 a recurrent neural network with LSTM layers, the MHeath
   Takeaway 3.2: Observing the objective landscape of            dataset that we used is a realistic, high-dimensional time se-
   data is one simple, flexible, and accurate way to iden-       ries dataset that shows an example of the impacts that data
   tify the intrinsic dimension for consideration along          transformation can have on an ML model.
   with any data transformations.                                   Our results conclude that when the dimension approaches
                                                                 the optimal intrinsic dimension, lower codimension and
                                                                 higher manifold coverage result in a lesser need to gener-
Intrinsic Dimension on Robustness with MHealth With
                                                                 alize features and reduce the inherent vulnerability to adver-
the dimensionality reduction technique, PCA, we were able
                                                                 sarial examples. However, it is important to note that reach-
to see that the performance was only consistent in the case
                                                                 ing the intrinsic dimensionality is not enough to guarantee
when the input embedding dimensionality more closely ap-
                                                                 perfect robustness. The inevitability of adversarial examples
proached the intrinsic dimension. Given the intrinsic dimen-
                                                                 has recently been theoretically studied, and it is still not pos-
sionality reached with PCA 50%, the codimension was rel-
                                                                 sible to know the exact and consistent properties of real-
atively minimized resulting in the most restricted number of
                                                                 world datasets or the resulting fundamental limits of adver-
directions for the adversary to take advantage.
                                                                 sarial training for specific datasets (Shafahi et al. 2019). In
   On the other hand, for the feature selection techniques,      other words, the underlying distributions themselves can be
the lack of mapping to a lower dimension prevented the fea-      complex enough such that there may be no guarantee of per-
ture selection techniques to approximate the intrinsic dimen-    fect robustness against adversarial examples. Nonetheless,
sion as accurately as PCA, resulting in the poor performance     our work highlights the value of considering potential vul-
while under attack. However, since random forest selection       nerabilities introduced to ML pipelines through data trans-
closer approximates the intrinsic dimension (with 9 selected     formations and how ML practitioners may utilize the intrin-
features), the attack success rate differs to low variance se-   sic dimension to reduce the overall complexity of models,
lection by approximately 10%. Also, for the candlesticks,        avoid introducing additional vulnerabilities, and create more
the transformation into the four-tuple strayed the furthest      reliable pipelines.
away from the intrinsic dimensionality by reshaping the fea-
                                                                    Lastly, as a future direction, the analysis of data trans-
tures. This transformation contributed in fundamental infor-
                                                                 formations (linear and non-linear) on adversarial examples
mation loss for the dataset while straying away from the
                                                                 may benefit a model under a poisoning attack. Such analysis
intrinsic dimension resulting on the higher relative end of
                                                                 could provide insight into how certain data transformations
codimension and one of the most efficient creation of ad-
                                                                 can extricate adversarial noise to increase model robustness.
versarial examples with a 60.98% decrease in robustness at
ϵ = 1.0.
                                                                                         References
   Takeaway 3.3: To avoid introducing additional vul-            Aleman, C. S.; Pissinou, N.; Alemany, S.; and Kamhoua,
   nerabilities in ML pipelines, one must observe and            G. A. 2018. Using Candlestick Charting and Dynamic Time
   understand the particular dataset’s intrinsic char-           Warping for Data Behavior Modeling and Trend Prediction
   acteristics and ensure any transformation does not            for MWSN in IoT. In 2018 IEEE International Conference
   stray from the intrinsic dimension.                           on Big Data (Big Data), 2884–2889. IEEE.
                                                                 Amsaleg, L.; Bailey, J.; Barbe, A.; Erfani, S. M.; Furon, T.;
                     7   Conclusion                              Houle, M. E.; Radovanović, M.; and Nguyen, X. V. 2020.
                                                                 High Intrinsic Dimensionality Facilitates Adversarial At-
For this work, we have provided an example where linear          tack: Theoretical Evidence. IEEE Transactions on Informa-
data transformation techniques can change an adversary’s         tion Forensics and Security, 16: 854–865.
ability to create effective adversarial examples. From the
conclusions presented in Amsaleg et al. (2020), one could be     Banos, O.; Moral-Munoz, J. A.; Diaz-Reyes, I.; Arroyo-
led to believe a transformation that has reduced complexity      Morales, M.; Damas, M.; Herrera-Viedma, E.; Hong, C. S.;
and high training/testing accuracy would be inherently more      Lee, S.; Pomares, H.; Rojas, I.; et al. 2015. mDurance: a
robust. However, their conclusion stands between datasets        novel mobile health system to support trunk endurance as-
of different complexities but does not speak on the poten-       sessment. Sensors, 15(6): 13159–13183.
tial impacts of data transformations. Positive impacts by di-    Bhagoji, A. N.; Cullina, D.; Sitawarin, C.; and Mittal, P.
mensionality reduction techniques are only presented where       2018. Enhancing robustness of machine learning systems
via data transformations. In 2018 52nd Annual Conference        Huang, J.; and Zhou, W. 2019. Re 2 EMA: Regularized and
on Information Sciences and Systems (CISS), 1–5. IEEE.          Reinitialized Exponential Moving Average for Target Model
Biggio, B.; and Roli, F. 2018. Wild patterns: Ten years after   Update in Object Tracking. In Proceedings of the AAAI Con-
the rise of adversarial machine learning. Pattern Recogni-      ference on Artificial Intelligence, volume 33, 8457–8464.
tion, 84: 317–331.                                              Ilyas, A.; Santurkar, S.; Tsipras, D.; Engstrom, L.; Tran, B.;
Bramer, M.; and Devedic, V. 2004. Artificial Intelligence       and Madry, A. 2019. Adversarial examples are not bugs,
Applications and Innovations. Springer.                         they are features. Advances in neural information processing
                                                                systems 32.
Carlini, N.; Athalye, A.; Papernot, N.; Brendel, W.; Rauber,
                                                                Khoury, M.; and Hadfield-Menell, D. 2018.             On the
J.; Tsipras, D.; Goodfellow, I.; Madry, A.; and Kurakin, A.
                                                                geometry of adversarial examples.             arXiv preprint
2019. On evaluating adversarial robustness. arXiv preprint
                                                                arXiv:1811.00525.
arXiv:1902.06705.
                                                                Klinker, F. 2011. Exponential moving average versus mov-
Carlini, N.; and Wagner, D. 2017a. Adversarial examples         ing exponential average. Mathematische Semesterberichte,
are not easily detected: Bypassing ten detection methods. In    58(1): 97–107.
Proceedings of the 10th ACM Workshop on Artificial Intelli-
gence and Security, 3–14.                                       Krizhevsky, A. 2009. Learning multiple layers of features
                                                                from tiny images. Technical report.
Carlini, N.; and Wagner, D. 2017b. Towards evaluating the
robustness of neural networks. In 2017 ieee symposium on        LeCun, Y.; Cortes, C.; and Burges, C. 2010. MNIST
security and privacy (sp), 39–57. IEEE.                         handwritten digit database. ATT Labs [Online]. Available:
                                                                http://yann.lecun.com/exdb/mnist, 2.
Cheng, Z.; and Lu, Z. 2018. A novel efficient feature di-
                                                                Li, C.; Farkhoor, H.; Liu, R.; and Yosinski, J. 2018. Measur-
mensionality reduction method and its application in engi-
                                                                ing the intrinsic dimension of objective landscapes. Interna-
neering. Complexity, 2018.
                                                                tional Conference on Learning Representations.
Chmielewski, L.; Janowicz, M.; Kaleta, J.; and Orłowski, A.     Naranjo, R.; and Santos, M. 2019. A fuzzy decision sys-
2015. Pattern recognition in the Japanese candlesticks. In      tem for money investment in stock markets based on fuzzy
Soft computing in computer and information science, 227–        candlesticks pattern recognition. Expert Systems with Appli-
234. Springer.                                                  cations, 133: 34–48.
Chollet, F.; et al. 2015. Keras. https://github.com/fchollet/   Nicolae, M.-I.; Sinn, M.; Tran, M. N.; Buesser, B.; Rawat,
keras.                                                          A.; Wistuba, M.; Zantedeschi, V.; Baracaldo, N.; Chen, B.;
Dash, M.; and Liu, H. 1997. Feature selection for classifica-   Ludwig, H.; Molloy, I.; and Edwards, B. 2018. Adversarial
tion. Intelligent data analysis, 1(3): 131–156.                 Robustness Toolbox v1.2.0. CoRR, 1807.01069.
Elsayed, G. F.; Shankar, S.; Cheung, B.; Papernot, N.; Ku-      Shafahi, A.; Huang, W. R.; Studer, C.; Feizi, S.; and Gold-
rakin, A.; Goodfellow, I.; and Sohl-Dickstein, J. 2018. Ad-     stein, T. 2019. Are adversarial examples inevitable? In In-
versarial examples that fool both human and computer vi-        ternational Conference on Learning Representations.
sion. arXiv preprint arXiv:1802.08195.                          Shlens, J. 2014. A tutorial on principal component analysis.
Facco, E.; d’Errico, M.; Rodriguez, A.; and Laio, A. 2017.      arXiv preprint arXiv:1404.1100.
Estimating the intrinsic dimension of datasets by a minimal     Su, J.; Vargas, D. V.; and Sakurai, K. 2019. One pixel at-
neighborhood information. Scientific reports, 7(1): 1–8.        tack for fooling deep neural networks. IEEE Transactions
Feinman, R.; Curtin, R. R.; Shintre, S.; and Gardner, A. B.     on Evolutionary Computation.
2017. Detecting adversarial samples from artifacts. arXiv       Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
preprint arXiv:1703.00410.                                      D.; Goodfellow, I.; and Fergus, R. 2014. Intriguing prop-
Golay, J.; and Kanevski, M. 2017. Unsupervised feature se-      erties of neural networks. In International Conference on
lection based on the Morisita estimator of intrinsic dimen-     Learning Representations.
sion. Knowledge-Based Systems, 135: 125–134.                    Tjeng, V.; Xiao, K.; and Tedrake, R. 2019. Evaluating ro-
Goodfellow, I.; McDaniel, P.; and Papernot, N. 2018. Mak-       bustness of neural networks with mixed integer program-
ing machine learning robust against adversarial inputs. Com-    ming. International Conference on Learning Representa-
munications of the ACM, 61(7).                                  tions.
Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explain-   Van Der Maaten, L.; Postma, E.; and Van den Herik, J. 2009.
ing and harnessing adversarial examples. arXiv preprint         Dimensionality reduction: a comparative. J Mach Learn
arXiv:1412.6572.                                                Res, 10(66-71): 13.
                                                                Wu, X.; Jang, U.; Chen, L.; and Jha, S. 2017. Manifold
Hendrycks, D.; Carlini, N.; Schulman, J.; and Steinhardt,
                                                                assumption and defenses against adversarial perturbations.
J. 2021. Unsolved problems in ml safety. arXiv preprint
                                                                arXiv preprint arXiv:1711.08001.
arXiv:2109.13916.
                                                                Xu, W.; Evans, D.; and Qi, Y. 2018. Feature squeezing: De-
Hendrycks, D.; Lee, K.; and Mazeika, M. 2019. Using pre-        tecting adversarial examples in deep neural networks. Net-
training can improve model robustness and uncertainty. In-      work and Distributed Systems Security Symposium.
ternational Conference on Machine Learning.