1. Introduction

Dictionary Learning for data compression within a Digital Twin Framework

Laura Cavalli

Domitilla Brandoni

Margherita Porcelli

1 2

Eric Pascolo

0 0 CINECA , Via Magnanelli 2, Casalecchio di Reno (BO), 40033 , Italy 1 Dipartimento di Ingegneria Industriale, Università degli Studi di Firenze , Viale Morgagni 40/44, 50134, Firenze , Italy 2 ISTI-CNR , Via Moruzzi 1, Pisa , Italy. INdAM Research Group GNCS

Digital Twin system plays a crucial role in several contexts, from smart agriculture to predictive maintenance, from healthcare to weather modelling. To be efective, it involves a continuous exchange of massive data between IoT sensors on real world and digital system hosted on HPC and vice versa. Nevertheless, the transmitted signals often exhibit high similarity, resulting in a redundant dataset very suitable for compression. This paper shows how Dictionary Learning can be used as a preprocessing technique for AI algorithms due to its ability to compress large data volumes up to 80% with a potential enhancement of the performances acting both as a denoising and compression technique. This algorithm operates eficiently on various types of datasets, from images to timeseries, and is well-suited for deployment on devices with limited computational resources, like IoT sensors.

eol>Digital Twin Dictionary Learning parallel OMP timeseries compression images compression anomaly detection image recognition

1. Introduction

constitute the columns of . In this work we will show that DL has various features that make it very suitable A digital twin can be simply seen as a system consist- for use in data compression and transmission: i) it ening of two entities, a tangible, subject-of-interest, and its ables exceptional compression of redundant data due to digital replica, interconnected by a continuous stream of its distinctive sparse factorization feature; ii) it is a verdata. In this context, data reflecting the physical entity satile approach being able to handle diverse data types, are acquired through IoT sensors and sent to a dedicated including images and time series; iii) its solution can be HPC which constitutes its digital mirror. Within the HPC, performed with an algorithm, supplied in this work, with data undergoes AI analysis to simulate the behavior and low computational resource demand and independent of potential scenarios of the physical entity. The resulting specific libraries, making it lightweight and well-suited insights are looped back into the physical system, im- for edge computing. pacting decision-making. An eficient transmission and The literature on DL comprises many applications storage of such large volumes of sensor data are therefore across various fields, including denoising, inpainting, crucial to reduce latency between the two systems ensur- classification, and compression. Regarding data coming a reliable real-time digital representation, but this is pression, an interesting online DL approach is proposed often prohibitively expensive. For this reason, it is neces- in [ 1 ] where massive datasets streamed through in a presary to explore compression algorithms that lighten and set order are compressed and denoised. Furthermore, the speed up data transmission while preserving their mean- work [ 2 ] presents CORAD, a novel DL-based compression ingful information. Among the available state-of-the-art algorithm for time series which is able to harness the corcompression tools, we explore Dictionary Learning (DL), relation across multiple related time series to eliminate a robust sparse matrix factorization approach. Given a redundancy performing a more eficient compression. matrix of signals , DL is able to learn a sparse repre- However, as far as we know, this work is the first to insentation ≈ expressing each signal as a linear corporate DL as a compression method within the Digital combination of few basis elements, called atoms, which Twins (DT) domain, using it as a powerful preprocessing technique for both time series and images. Also, we Intiazle-dIAby20C2I4N:I4,tMhaNyat2i9o-n3a0l, C20o2n4f,erNeanpcleeso,nItAarlytificial Intelligence, orga- developed an optimized DL algorithm for increasing its * Corresponding author. lightweight and eficiency in the DT framework. $ l.cavalli@cineca.it (L. Cavalli); d.brandoni@cineca.it This work is structured as follows: Section II gives (D. Brandoni); margherita.porcelli@unifi.it (M. Porcelli); a brief overview of the DL problem and of its solution. e.pascolo@cineca.it (E. Pascolo) Section III integrates the DL approach within a DT frame(M.0P00o0rc-0e0ll0i)2-8157-1459 (D. Brandoni); 0000-0003-0183-1204 work and presents the overall DL4DT workflow, while © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Section IV discusses numerical results, conducting a deAttribution 4.0 International (CC BY 4.0). tailed analysis of the algorithm performance across various datasets. Additionally, it introduces several techniques designed to improve the algorithm execution speed. All the codes necessary to reproduce the experiments shown in this paper are available at the following link: https://github.com/Eurocc-Italy/DL4DT.

2. Dictionary Learning overview

The aim of DL is to discover an overcomplete set of basis functions (atoms) able to represent in a sparse manner a given set of data samples. Given a matrix of training signals ∈ R× ( ≪ ), DL seeks to find a dictionary ∈ R× ( ≪ ) and a sparse matrix ∈ R× to represent ≈ . The DL problem can be formulated in many equivalent ways, each one promoting a diferent aspect of the problem as shown in detail in [ 3 ]. In this case we decided to formulate it as a two variable, non-convex, constrained optimization problem of the form m,in ‖ − ‖2 s.t.

‖x‖0 ≤ , = 1, . . . , ‖d ‖2 = 1, = 1, . . . , (1) where the number of atoms and the sparsity level are ifxed. Here, ‖ · ‖ 2 and ‖ · ‖ 0 denote the ℓ2 and ℓ0 norm of a vector, respectively, and ‖ · ‖ is the Frobenius norm.

Problem (1) is NP-hard and admits multiple global optima; therefore the convergence to the global minimum is not guaranteed. In order to solve the DL problem, we follow the usual alternate optimization approach. More precisely, given the signal matrix and an initial dictionary , at each iteration first the minimization problem in is solved while is fixed ( Sparse Coding step) and then the minimization problem in is solved while keeping (possibly) fixed ( Dictionary Update step).

The problem to be solved at the sparse coding step can be formulated as follows

Algorithm 1 OMP (naive approach) [4]

Given y ∈ R, the sparsity level , the dictionary ∈ R× and the stopping tolerance > 0 Initialize = ∅, e = y while || < and ‖e‖2 > do = argmax∈/ |e d | = ∪ {} x = ( )− 1 y e = y − x end while Since at each step the current matrix is updated by simply appending one column, a more eficient implementation can be obtained by exploiting the least squares solution just computed at the previous step. The most famous approaches make use of the Cholesky decomposition of [4, sec. 2.2] or the QR decomposition of [4, sec. 2.3]. Our computational experience showed that the OMP-QR implementation is faster when applied to DL [5]. Therefore, we implemented our parallel version of the OMP-QR code to speed-up the computational times.

Regarding the Dictionary Update step, the following minimization problem has to be solved m,(in) ‖ − ‖2 s.t.

‖d ‖2 = 1, = 1, . . . , (4) where the sparsity pattern of is fixed. For this task we followed the K-SVD approach [6].

3. Dictionary Learning to reduce latency in Digital Twin Reducing data latency is one of the main challenges

within the DT context. This section aims to outline the proposed workflow, named DL4DT, to decrease data transmission time using DL as a compression technique. min ‖ − ‖2 s.t. ‖x‖0 ≤ , = 1, . . . , . (2) DL4DT, illustrated in Figure 1, takes place in two stages.

First of all (Fig.1 top), the data are collected from the physthat can be decomposed in the solution of problems, ical device, represented as a matrix and then transmiti.e. one for each signal ted to the digital counterpart. Here, the entire process of DL factorization is applied to , resulting in the learning min ‖y − x‖22 s.t. ‖x‖0 ≤ , = 1, . . . , . (3) of a reliable and robust overcomplete dictionary and x the sparse representation . The dictionary is both For solving each problem (3), we employed Orthogonal saved on the digital system and transmitted back to be Matching Pursuit (OMP), an iterative greedy algorithm saved also on the physical one. Afterwards, a new smaller that selects at each step the atom which is best correlated dataset of signals 1 is collected (Fig.1 bottom). Instead of with the residual e := y − x. Then it produces a transferring the complete 1, we claim that computing its new approximation by projecting the signal y onto the sparse representation 1 with OMP using the reference dictionary elements that have already been selected ( ). dictionary from stage 1 is suficient. Transmitting 1, We report in Algorithm 1 a naive version of OMP where which is highly sparse, indeed improves transmission the least squares solution is computed from scratch time and reduces costs: solving a single Sparse Coding at each step (refer to [4] for more details). step demands fewer computational resources compared to full DL, and transferring only 1 is lighter than sending the entire 1. Indeed, suppose that 1 has signals of features each. Instead of passing all the × elements, with our method is enough to transmit the × non-zero elements of 1. Notice that in sparse matrices, each non-zero element is stored as a triplet (row_index, column_index, non_zero_value) requiring a total storage of × × 3 values. Therefore, the benefit of transferring 1 results in a reduction of 1 − 3 . Moreover, users have the flexibility to specify under which conditions the dictionary has to be updated, in order to have more reliable results. For example, a reasonable choice can be updating the dictionary after a fixed period of time or when the accuracy of the AI algorithm on the compressed dataset starts to decrease too much. We refer to these conditions as user_conditions in the forthcoming Algorithm 2. As we will prove, DL4DT is very efective since DL techniques allow to massive compression preserving main important features of the dataset. DL4DT has been resumed in Algorithm 2.

Algorithm 2 DL4DT: workflow of a DT process with DL

techniques.

Collect data on the physical counterpart in matrix . Send to the digital system.

Compute the dictionary and the sparse matrix with DL factorization of on the digital system. = 0 while True do if = 0 then

Send the dictionary to the physical system and store it. else

Compute using OMP-QR on the physical

system.

Send to the digital system. end if = + 1 Compute ˜ = on the digital system.

Apply AI algorithm using ˜ as dataset. if user_conditions then

break end if end while computing nodes each 2 × CPU Intel CascadeLake 8260, with 24 cores each, 2.4 GHz, 384GB RAM and NVIDIA Mellanox Infiniband 100GbE network.

4.1. Datasets

We focused on three datasets with various types of data (images or timeseries) and dimensions: MNIST [8], FordA [9], and a fine-grained timeseries on the D.A.V.I.D.E. HPC system [ 10, 11 ]. D.A.V.I.D.E. is a supercomputer developed by E4 Computer Engineering [12] and hosted in the past by CINECA, with an integrated monitoring infrastructure called Examon [ 10 ]. In this work we focused on a subset of the data collected by Examon: for each of the 45 nodes, were considered 166 metrics such as core workloads, temperatures, fan speeds, power consumption, etc collected in 5-minute intervals. In detail, we focused on the 16th node.

4.2. Dictionary Learning compression

4. Numerical Results To evaluate the efectiveness of our compression, it is essential to compare the information generated by AI modIn this section, after introducing the datasets, we vali- els trained on both the original and compressed datasets. date the DL approach as an efective compression tool This is crucial within the DT framework, where our prifor addressing DT latency problems. Then, we simulate mary aim is to extract valuable insights from compressed and analyze the DL4DT workflow presented in Section data. 3, exploiting the DL ability to build a highly representa- We considered a CNN tailored for digit recognition tive dictionary. All experiments were run on Galileo100 [13] on MNIST dataset, a CNN able to perform anomaly [7], an HPC infrastructure owned by CINECA with 528 detection suggested in [14] on FordA and an autoencoderbased model able to automatically detect anomalies in a pression settings. The overall accuracy, approximately semi-supervised fashion ([ 10, 11 ]) on D.A.V.I.D.E. After 86%, is lower than previous cases as expected due to the training the NNs described above on both original and real-world nature of the dataset. However we notice that compressed datasets, we compared their performance the test accuracy reached by training the autoencoder on on the same test set by studying the accuracy, which the compressed training dataset is almost identical to the is defined as the ratio of the number of correct predic- one obtained with no compression. However, when dealtions over the total number of predictions. Figure 2 com- ing with imbalanced datasets, it is better to consider the pares respectively the test accuracy achieved by the NNs F-score value achieved for each class (normal signals and trained on the original dataset (green dotted line) and anomalies) rather than the accuracy. F-score value is deon a DL compression of MNIST (top) and FordA (bot- fined as F-score := 2 ×+ , where tom) concerning a sparsity level of = 50 and a number and are the ratio of true positives to the total preof iterations = 20 (orange solid line) across various dicted positives and to the actual positives, respectively. compression levels. The results obtained with other set- We notice that the F-score reached on normal signals, tings of DL are shown in more detail in [5]. As expected, y c a r u c c a y c a r u c c a 100 90 80 70

40 100 90 80 70 50 60 70

80 % compression 40 50 60

70 % compression no compression the accuracy computed on the compressed datasets is lower than the one computed on the original dataset. Despite not matching exactly the original accuracy, we still achieve extremely good results: with MNIST dataset we shown in the middle of Fig.3, remains almost unafected can even reach an accuracy of 97% with a compression of by compression: across various DL configurations, the 80% against an accuracy of 99% with no compression, this F-score consistently remains close to 98%, as the original is probably due to the redundant nature of the datasets, case without compression. This finding aligns with our which makes it possible to achieve high accuracy lev- expectations, as the training set in this example consists els even with high levels of compression. On FordA an only of signals without anomalies. As for the F-score of overall accuracy of 91% is reached even with high com- anomalies, shown at the bottom of Fig.3, we observe that pression levels against 96% with no compression. Figure 3 this value increases when compression is more intense. shows at the top the test accuracy achieved by the autoen- Examining the details of the Recall and Precision values coder trained on the original D.A.V.I.D.E dataset (green for these cases (Table 1), we notice that, respectively, the dotted line) and on the dataset compressed with DL with Recall for normal signals and the Precision for anomalies = 5 and = 10 (orange solid line) and diferent com- are higher compared to the case without compression.

4.3. Dictionary representativity 0 % normal 99.8 95.4 As already mentioned, the data provided by a DT do not 80 % normal 99.8 96.3 usually show great variability. This section aims to verify 0 % anomaly 79.8 99.1 whether the dictionary learned in the first stage is robust 80 % anomaly 84.2 99.1 enough to accurately represent newly collected data. If successful, it would make it possible to run the sparse coding step (OMP-QR) without the need for a dictionary These two values (Recall of normal signals and Precision update. In particular we integrate the study of dictionary of anomalies) take into account the cases where certain representativity into a simulation of the DL4DT workflow signals are identified as anomalies even though they are on D.A.V.I.D.E. dataset, keeping track of the original sizes, not. The higher the value, the more this type of error is compression levels, and times. avoided. Therefore, it is consistent that DL compression The goal of the first stage is to learn a reliable and can increase these values, as DL is known as a valuable representative dictionary. Thus, we begin by considerdenoising tool, leading to improved anomaly detection. ing the 4432 signals of its training set. In our workflow

Let us explore some implementations of the code. In these data are sent to the digital twin where we choose our scenario, we have to deal with substantial problem to apply the strongest yet most meaningful compression, dimensions but we can also benefit of the computational i.e. compression of 80 % with = 20, = 349 and 10 resources of an HPC cluster in the first stage of the work- iterations. From previous studies we know that such a lfow presented in Section 3. These resources can be compression can reach an overall F-score level of about fully employed in the OMP algorithm which can be par- 97.9% on normal signals and 90.7% on anomalies, taking allelized with the Joblib python library [15] following around 3 minutes. Then the dictionary is stored both in what was mentioned in Section 2. Figure 4 illustrates the the digital twin and sent back to the physical one. speedup achieved by executing OMP-QR both serially After a fixed time interval a new matrix of signals 1 is and in parallel with an increasing number of processors, collected on the physical system. We simulate this new where speedup is the ratio of the execution time of the matrix of signals by taking the test set relative to the 16th serial code to the execution time of the parallel code node, since it is completely new to the dictionary and performing the same task. presents anomalies. We then compute its sparse representation matrix 1 with a single run of OMP-QR with 16 = 15, taking around 3 seconds. The sparse representation matrix is then sent to the digital system where is pU used to reconstruct the signals as ˆ1 = 1. To evalued 8 ate the information loss due to the data compression we eSp 4 consider the autoencoder trained in the first run on the 2 compressed train set and look if it is still able to detect the same anomalies testing it on the compressed test set 2 4 8 16 ˆ1. We obtain extremely good results, achieving an F# tasks score of 97% on normal samples and 89.9% on anomalies. ideal joblib These outcomes are very close to the results obtained without compression, which were respectively 97.9% and Figure 4: Speed up of OMP-QR algorithm in serial and with 90.7%. The DL setting that we choose is indeed a sensiJoblib parallelization. For this type of problem it is not mean- ble choice: increasing the compression level contributes ingful to increase resources beyond 16 tasks. to smooth the signals with beneficial results, yet it remains highly representative with the sparsity level set to

The proposed parallelization has a significant impact = 20. We conduct a similar experiment using random on the total computational time of the DL algorithm: compression, instead of DL, retaining only 30% of the when the plane DL algorithm is run sequentially with samples chosen randomly from the test set, obtaining a single CPU, it requires about 20 hours to complete 20 a F-score equal to 98% on normal samples and 63% on iterations on a matrix of size 784 × 60.000, while the anomalies which is definitely worst. Thanks to this worksame algorithm implemented with the Joblib parallelized lfow, instead of transmitting the entire signal matrix 1 version of OMP-QR using 16 CPUs completes the task in of dimensions 165 × 3074, is enough to compute and about 5 hours. We have also developed a light C version transfer its sparse representation 1 which requires the storage of 20 × 3074 × 3 elements. This results in memory gain of 73%, requiring only 3 seconds and causing a minimal loss of information.

This process can be iterated multiple times, until the dictionary requires updating to ensure more accurate outcomes. For instance, the dictionary might be refreshed periodically or whenever the performance of the AI algorithm on the compressed dataset begins to significantly decline. The results confirm that the dictionary learned on the training set manages to represent new signals quite efectively. Indeed the accuracy levels achieved by the signals reconstructed with the old dictionary are good, allowing a significant gain in computational eficiency.

[1]

Archibald ,

Tran , A dictionary learning algorithm for compression and reconstruction of streaming data in preset order , Discrete and Continuous Dynamical Systems - Series S 15 ( 2021 ).

doi:10 .3934/dcdss.2021102.

[2]

Khelifati ,

Khayati ,

Cudré-Mauroux , Corad: Correlation-aware compression of massive time series using sparse dictionary coding , in: 2019 IEEE International Conference on Big Data (Big Data) , 2019 , pp. 2289 - 2298 . doi: 10 .1109/ BigData47090. 2019 . 9005580 .

[3]

Dumitrescu ,

Irofti , Dictionary Learning Algorithms and Applications , Springer Cham, 2018 . doi:https://doi.org/10.1007/ 5. Conclusions 978-3- 319 -78674-2.