1. Introduction

Sensitive Data Comparison Algorithm Based Spatio-temporal Label Distribution Fusion 1

Pengfei Yu

Congcong Shi

0 0 State Grid Smart Grid Research Institute Co., Ltd. State Grid Key Laboratory of Information & Network Security , Nanjing China

85 91

While the IoT cooperates with big data to deeply reconstruct all walks of life, it also poses more severe challenges to data security. Accurate identification of sensitive data is a prerequisite for data security. Compared with traditional machine learning algorithms, deep learning algorithms show great functionality and flexibility in large-scale data processing. However, the existing deep learning-based sensitive data identification methods focus on the mining of a single content feature, ignoring contextual information, and the identification accuracy of sensitive data with insignificant content features is not high.Therefore, this paper proposes a sensitive data comparison algorithm based on spatiotemporal label distribution fusion. The algorithm can simultaneously model the spatial and temporal patterns of the data flow, mine the spatial and temporal labels, and identify the type of data through a comprehensive judgment strategy. It solves the problem of identifying sensitive data with insignificant content characteristics. Finally, the algorithm is independently repeated experiments on multiple data sets and compared with multiple algorithms. The results show that the Best F-score and NAB score of this model are significantly better than other algorithms, which are 0.812 and 69.2, respectively. The algorithm proposed in this paper can more accurately identify sensitive data.

eol>label distribution sensitive data time stamp space label

1. Introduction

Markedness ambiguity is a hot research direction in the field of machine learning. In the existing machine learning paradigm, there are mainly two data labeling methods: (1) assigning a label to an example; (2) An example assigns multiple tags. Single-Label Learning (SLL) assumes that all the examples in the training set are labeled in the first way, while Multi-Label Learning (MLL) [ 1 ] allows the training examples to be labeled in the second way. Therefore, multi-label learning can deal with the ambiguity that an example belongs to multiple categories. Whether it is single-label learning or multilabel learning, it aims to answer an essential question, that is, "which labels can describe this example?" . However, none of them directly answered the relative importance of each marker to this example.

For many problems in the real world, the importance of different markers is often different. For example, a natural scene image [ 2 ] is marked with multiple markers such as "sky", "water", "forest" and "clouds", but these markers describe the image in different degrees; In facial emotion analysis [ 3 ], people's facial expressions are often the result of a mixture of many basic emotions (such as happiness, sadness, surprise, anger, disgust and fear), and these basic emotions often express different intensities in a specific expression, thus presenting complicated emotions. There are many similar examples, because once an example is related to multiple markers at the same time, these markers are generally not all equally important to the example, but are more likely to have the priority. For applications similar to the above examples, a natural method is to assign a real number dxy to each possible mark y for an example x, indicating the degree to which y describes x. Without losing generality, suppose dxy ∈[ 0,1 ], and further suppose that the tag set is a complete set, that is, all the tags in the set can completely describe an example, so ∑y dy =1. dxy that meets the above two conditions is called the description degree of y to x.

x For an example, the descriptions of all labels constitute a data structure similar to probability distribution, so it is called label distribution, and the process of learning on the data set labeled with label distribution is called Label Distribution Learning (LDL) [ 4 ].

Distributed label learning can be described as follows:

Let X=Rq represent the feature space of the example, and Y={y1, y2, ...,yc} represent the marker space. Given a training set S={ (x1,d1), (x2, d2), ..., (xn,dn)}, the goal of label distribution learning is to learn a conditional probability quality function p (y|x) from S, where x∈X and y ∈ Y.

Assume that the parametric model of p (y|x) is expressed as p (y|x; θ), where θ is the parameter vector. Given the training set S, the goal of label distribution learning is to find a θ, so that given the example xi, p (y|x; θ) can generate a marker distribution as similar as possible to the real marker distribution di of xi.

2. Related Technology 2.1 Variational Auto-Encoder

Variational Auto-Encoder (VAE) is a generative model proposed by Kingma and Welling in "AutoEncoding Variational Bayes" in 2014 [ 5 ]. Its network structure is consistent with AE, which consists of encoder and decoder. A known encoder can encode raw data into a low-dimensional vector, and we call this known initial vector a latent vector. The AE algorithm achieves the purpose of reproducing the input to the best of its ability, but it cannot generate any unknown data because it cannot generate reasonable latent variables at will. To solve this problem, the VAE constrains the encoder to produce latent variables that follow a unit Gaussian distribution.

The biggest difference between VAE and AE is that the AE middle layer outputs the specific values of the hidden variables, while the VAE middle layer outputs the specific distribution of the hidden variables. Unlike AE, which produces real-valued vectors, VAE's encoder produces two vectors: one for the mean and one for the standard deviation. This way, the model can take additional samples from this distribution and feed it into the decoder.

It should be noted that the error of the model is not only the reconstruction error at this time, VAE needs to balance the accuracy of the reconstructed data and the fit of the unit Gaussian distribution, so the loss function is the sum of two aspects: on the one hand, and Like AE, the output and the input are used for comparison, that is, the reconstruction error, which is generally measured by Kullback Leibler Divergence (KLD).

The VAE constraint on the Gaussian distribution of the decoder variable, in addition to enabling it to generate random latent variables, also greatly improves the ability of the network to generate pictures. For example, assuming that each real number in the interval [ 0, 10 ] corresponds to an object name, the interval can represent an infinite number of object names. For example, 7.01 corresponds to apples, and 7.02 corresponds to bananas. When data 7.01 is received, it is known that it represents Apple. Considering that real-world data contains a certain amount of Gaussian noise, when the received data is 7.01, the original value may be any number between [6.5~7.5], such as 7.02 (banana). Therefore, the greater the variance of a given data, the less usable information this vector of averages will carry. Similarly, in VAE, the more efficient the encoding, the closer the standard deviation vector is to the unit standard deviation of the standard Gaussian distribution. This constraint forces the encoder to be more efficient and able to generate informative latent variables. This in turn improves the performance of generating images. 2.2

Hierarchical Temporal Memory

Hierarchical Temporal Memory (HTM), also known as cortical learning, is a new generation of artificial intelligence algorithms published by Numenta, and has now launched the corresponding Python platform and visual recognition software toolbox. HTM originated from the memory-prediction framework proposed by Jeff Hawkins in his book "On Intelligence". The framework has a bionic hierarchical structure, which can be modeled by memory patterns and sequences, and information between levels is transmitted up and down. HTM is designed to simulate how the neocortex works, turning complex problems into pattern matching and prediction. True to its name, this algorithm differs from ordinary neural network algorithms in many ways. HTM emphasizes the layering of "neurons". Hierarchy, Invariant Representations of Spatial Patterns and Temporal Patterns of information, and Sequence Memory are the three core points of HTM.

The fundamental difference between HTM and neural network algorithms is like the difference between general circuits and gate circuits. Connecting the simulated "neurons" according to the structure of the neocortex will produce a completely different effect from the general neural network. The general neural network pays attention to feedforward, while the HTM algorithm pays more attention to the twoway communication of information, which is also the reason why neuroanatomy found that the number of feedback synapses is no less than that of feedforward. And feedback doesn't get most people's attention.

In addition, most of the traditional artificial intelligence algorithms are designed for specific task objectives, while the HTM algorithm focuses on transforming the problem into a pattern matching and prediction problem before solving it, making the "unified theory" of artificial intelligence possible. HTM algorithms are based on a lot of anatomy and neuroscience. The HTM algorithm believes that the new cerebral cortex is an indispensable and necessary condition for human intelligence, and it is responsible for high-level brain activities. Our brains work by matching the various patterns we receive with those in memory, predicting and reacting to the information we will receive in the next moment, and so on. This is the manifestation of its timeliness (Temporal).

3. Sensitive Data Comparison Model 3.1 Model Overview

1 - Lt Pending I Type I Type II

Pending II η 1

3.2 Label Distribution Decision Strategy

The decision strategy is shown in Fig. 2. The abscissa is reconstruction probability Pr, the ordinate is 1-Lt , and the horizontal and ordinate ranges from 0 to 1. When the evaluation results of the two algorithms are consistent, the evaluation result of either party is the identification result of the sensitive data: the sensitive data both judged as Type II are classified as Type II sensitive data (red area "Type II"), and the sensitive data both judged as Type I are classified as Type I sensitive data (green area "Type I"). When the two algorithms determine the contradiction, the sensitive data will enter the pending state, and the system will further determine it: (1) If HTM determines that its time label distribution is the first type, but VAE determines that its space label distribution is the second type, the point will enter the pending area in the upper left corner (purple area "Pending I"). Reduce the value of sliding window W and observe Pr. If Pr continues to increase until the point enters the first-class area, the system determines that the sensitive data is the first-class sensitive data. On the contrary, if Pr is always less than η, it is determined that the sensitive data is the second sensitive data. (2) If HTM determines that its time label distribution is the second category, but VAE determines that its space label distribution is the first category, then the point enters the undetermined area in the lower right corner (yellow area " Pending II"). Increase the value of sliding window W and observe 1- Lt. If 1- Lt continues to increase until the point enters the first-class area, the system will determine the sensitive data as the first-class sensitive data. On the contrary, if 1- Lt is always less than γ, it is determined that the sensitive data is the second sensitive data.

The significance of the above operation is: for the point to be Pending I, the system knows the distribution of the time label but not its spatial label. After reducing the window value, if VAE admits that it belongs to the first class in this small range, it will be judged as the first class sensitive data, thus avoiding the false alarm caused by VAE's inability to respond to the concept drift phenomenon in time. For the point to be Pending Ⅱ, the system knows its spatial label distribution but not its time label distribution. After increasing the window value, if HTM learns this time pattern in a wider range, the prediction error of this point will be reduced, and the corresponding label distribution in the threshold range will change this point into the first sensitive data, thus avoiding the false alarm caused by HTM's inability to learn the complete distribution pattern due to its small window value. At the same time, the multi-terminal control of time-space distribution label fusion makes the system solve the problem of "whether the value of the sample point is right" and "whether the sample point should come at this time", which greatly reduces the false alarm rate and false alarm rate, and makes the model more detailed and three-dimensional.

4. Experiments and Results

For the test of sensitive data comparison, this paper selects 12 typical IoT data sets (three data sets for each type) from an open source real-world data set [ 5, 6 ], including CPU utilization, intelligent industrial system temperature sensor, Electro Cardio Gram (ECG) and HTTP service response delay.

Fig. 3 shows the comprehensive detection results of sensitive data comparison scheme based on spatio-temporal distribution tag fusion. Fig. 3 (a), 3 (b), 3 (c) and 3 (d) correspond to data sets A, B, C and D, respectively. A represents the data set of CPU utilization, B represents the data set of temperature sensor, (a) CPU usage

(b) Temperature (c) ECG (d) HTTP service delay

C represents the data set of ECG, and D represents the corresponding delay of HTTP service. The proportion of sensitive data markers in the original data set is 10%. In the figure, the abscissa is the time axis, the blue line is the original data, the red line is the spatial label result obtained from the encoder based on variation, and the green line is the time label result obtained from hierarchical real-time memory. It can be seen from the figure that VAE can stably and successfully detect the corresponding spatial tags for obvious abnormal changes, such as sudden peaks and valleys. Fortunately, HTM can sensitively detect subtle changes and successfully detect the time stamp.

Table 1 shows the model scores obtained by various algorithms using NAB scoring mechanism. The result of "null" detector is 0, the result of "perfect" detector is 100, and the result of "random" detector is the average of a series of random seeds. In addition, the algorithms involved in the comparison are as follows: Spatio-temporal sensitive data comparison schemes ST(Spatio-Temporal), HTM only, VAE only, LSTM-AE [ 7-9 ] which uses long short term memory, and LSTM-VAE [ 10-12 ] which uses long short term memory. Standard Score is the score obtained by NAB standardized calculation (AFP = AFN) after all data sets are tested, Reward Low FP is the score obtained by testing and calculating FP preference NAB (AFP > AFN) on D1~D3 data sets, Reward Low FN is the score obtained by testing and calculating FN preference NAB (AFP < AFN) on C1~C3 data sets.

Generally speaking, ST, the spatio-temporal fusion detection scheme in this paper, has the highest score, and the self-encoder with LSTM is ranked second, followed by the spatial detection algorithm that only uses VAE. HTM only and LSTM-VAE have the lowest scores. The results show that the spatiotemporal fusion measurement model proposed in this paper has excellent detection performance in anomaly detection tasks, and it also proves the effectiveness of using LSTM as encoder and decoder to fit time series.

5. Conclusions

To solve the problem that the existing machine learning algorithms have low recognition accuracy for sensitive data with insignificant features, this paper studies the sensitive data label generation technology and sensitive data label comparison technology, then proposes a sensitive data comparison model based on spatio-temporal label distribution fusion, and formulates a label distribution comprehensive judgment strategy for this model, which divides the sample data into the first sensitive data, the second sensitive data and the pending data. For the sample points in the undetermined area, dynamically debug by changing the window size, and make two rounds of judgment to get the final detection result. Experimental results show that the Best F-score and NAB score of this model are obviously due to other algorithms, which are 0.812 and 69.2 respectively.

However, in the experiment of this paper, the classified experimental data is only part of the training set of the original data set. Next, we can try to test the classification of the complete data set, which is of great significance to investigate the generalization ability and multi-classification ability of the model (up to 39 abnormal and 1 normal, totaling 40 kinds).

6. Acknowledgment

This paper is supported by the science and technology project of State Grid Corporation of China: " Research and Application of Scenario-Driven Data Dynamic Authorization and Compliance Control Key Technology " (Grand No. 5700-202058481A-0-0-00).

7. References

[11] C. Shi, R. Song, X. Qi, Y. Song, B. Xiao and S. Lu, "ClickGuard: Exposing Hidden Click Fraud via Mobile Sensor Side-channel Analysis," ICC 2020 - 2020 IEEE International Conference on Communications (ICC), 2020, pp. [12] Y. Song, S. Gao, A. Hu and B. Xiao, "Novel attacks in OSPF networks to poison routing table," 2017 IEEE International Conference on Communications (ICC), 2017, pp. 1-6.

[1]

Yang ,

Song ,

Gao ,

Hu and

Xiao , "Griffin: Real-Time Network Intrusion Detection System via Ensemble of Autoencoder in SDN," in IEEE Transactions on Network and Service Management , vol. 19 , no. 3 , pp. 2269 - 2281 , Sept. 2022 , doi: 10.1109/TNSM. 2022 . 3175710 .

[2]

Chen ,

Song ,

Jennings ,

Zhang ,

Xiao and

Gao , "IoT-ID: Robust IoT Device Identification Based on Feature Drift Adaptation," 2021 IEEE Global Communications Conference (GLOBECOM) , 2021 , pp. 1 - 6 , doi: 10.1109/GLOBECOM46510. 2021 . 9685693 .

[3]

Song ,

Chen , T. Wu,

Zheng ,

Chen and

Wang , "Enhancing Packet-Level Wi-Fi Device Authentication Protocol Leveraging Channel State Information , " Wireless Communications and Mobile Computing , vol. 2021 .

[4]

Song ,

Geng ,

Wang ,

Gao and

Shi , "Permission Sensitivity-Based Malicious Application Detection for Android," Security and Communication Networks , vol. 2021 .

[5]

Chen ,

Song , T. Wu,

Chen ,

Wang and

Li ,

“Enhancing

Wi-Fi Device Authentication Protocol Leveraging Channel State Information ,” 2021 International Conference on Mobile Multimedia Communications. Springer, Cham, 2021 : 33 - 46 .

[6]

Chen ,

Song ,

Zhu ,

Gao ,

Wang and

Hu , "Authenticating Mobile Wireless Device Through Per-packet Channel State Information," 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W) , 2021 , pp. 78 - 84 , doi: 10.1109/DSN-W52860. 2021 . 00024 .

[7]

Wu ,

Song ,

Zhang ,

Gao and

Chen , "My Site Knows Where You Are: A Novel Browser Fingerprint to Track User Position," ICC 2021 - IEEE International Conference on Communications, 2021 , pp. 1 - 6 , doi: 10.1109/ICC42927. 2021 . 9500556 .

[8]

Ma ,

Song ,

Wang ,

Gao ,

Xiao and

Hu , "You Can Hear But You Cannot Record: Privacy Protection by Jamming Audio Recording," ICC 2021 - IEEE International Conference on Communications, 2021 , pp. 1 - 6 , doi: 10.1109/ICC42927. 2021 . 9500456 .

[9]

Song ,

Gao ,

Xiao and

Hu , "I Know What You Type: Leaking User Privacy via Novel Frequency-Based Side-Channel Attacks," 2018 IEEE Global Communications Conference (GLOBECOM) , 2018 , pp. 1 - 6 , doi: 10.1109/GLOCOM. 2018 . 8647385 .

[10]

Song ,

Dong ,

Hu and

Gao , "WebLogger: Stealing your personal PINs via mobile web application," 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP) , 2017 , pp. 1 - 6 , doi: 10.1109/WCSP. 2017 . 8171036 .