Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


    RISK MODEL OF APPLICATION OF LIFTING METHODS
A. Bogdanov1, A. Degtyarev1,3, N. Shchegoleva1, V. Khvatov4, N. Zaynalov2,
                  J. Kiyamov1, A. Dik1,a, A. Faradzhov1
    1
        Saint Petersburg State University, 7-9 Universitetskaya emb., Saint Petersburg, 199034,
                                                Russia
           2
               Samarkand branch Tashkent university of information technology, Uzbekistan
3
    Plekhanov Russian University of Economics, 36 Stremyanny lane, Moscow, 117997, Russia
                                4
                                 DGT Technologies AG, http://dgt.world/

                                     E-mail: a st087383@student.spbu.ru

The article discusses the main provisions (methods, risk models, calculation algorithms, etc.) of the
issue of organizing the protection of personal data (PD), based on the application of anonymization
procedure. The authors reveal the relevance of the studied problem based on the tendency of the
general growth of informatization and the further development of the Big Data technology. This
circumstance leads to the need to use the so-called risk approach based on calculating the risk of PD as
a probabilistic assessment of the amount of possible damage that the owner of the data resource may
incur as a result of a successfully carried out information attack. For this purpose, the article describes
an algorithm for calculating the risk of PD and proposes a risk model of the depersonalization
procedure, which considers confidentiality problems arising both as a result of unauthorized access
and as a consequence of planned data processing. To describe the risk model of the anonymization
procedure, the types of attacks on the confidentiality of personal data, anonymization metrics and
equivalence classes are analyzed, as well as the attacker's profiles and data distribution scenarios.
Thus, the choice of a risk model for the depersonalization procedure was justified, and calculations for
the generated synthetic set of PDs were presented. As a conclusion, it should be noted that the model
of anonymization risk assessment proposed and tested on synthetic data makes it possible to abandon
the concept of guaranteed anonymized data, introducing certain boundaries for working with risks and
building a continuous process for assessing PD threats, taking into account the constantly growing
volume of stored and processed information.

Keywords: information protection, personal data, depersonalization, information systems,
model, risk of depersonalization procedure.


               Alexander Bogdanov, Alexander Degtyarev, Nadezhda Shchegoleva, Valery Khvatov, Nodir
                                             Zaynalov, Jasur Kiyamov, Aleksandr Dik, Anar Faradzhov


                                                                Copyright © 2021 for this paper by its authors.
                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                      369
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
        Recently, the problem of personal data protection (PD) has become more and more urgent. In
this aspect, the question concerning the peculiarities of using various methods of depersonalization
and related options for building a data risk model (risk model) is increasingly being raised. The
relevance of this topic in the modern world is connected with the further introduction of information
technologies into our lives, we are becoming more and more dependent on information systems and
services, and, consequently, more and more vulnerable to security threats. Information systems that
process personal data are particularly vulnerable to this risk. It is enough to remember the growth of
unauthorized dissemination of personal data and its consequences in recent years. As examples, we
should mention the theft of information about subscribers of mobile operators and other means of
communication, trading information about bank customers, insurance companies, etc. In connection
with these circumstances, it is advisable to consider the use of various methods of depersonalization as
promising and potential ways to protect personal data. The process of depersonalization of data is part
of the processing of personal data aimed at deleting identifying personal information, as a result of
such a process, new depersonalized secure data is formed based on the initial array of information with
personal data.


2. General description of the risk model
         Currently, the international practice of using depersonalization methods is shifting towards a
risk approach. In this case, the risk assessment is carried out in order to develop measures to ensure the
confidentiality of private information if it is necessary to publish depersonalized data. The emergence
of new sources of information makes it possible to compare data with previously published ones,
which inevitably leads to the appearance of risks of re-identification. This, in turn, forces us to
abandon the concept of guaranteed anonymized data, introducing certain boundaries of working with
(risk threshold) and building a continuous process of assessing threats to personal data. As part of the
standard approach, the risk is assessed based on the identification of threats (associated with the
profile of the intruder) and existing vulnerabilities. At the same time, it should be taken into account
that external and internal connections have a significant impact on the risk assessment: the availability
of additional information, the motivation of the attacker, the legal framework, the IT systems used,
management practices, etc. This leads to a division of risk between the risks of the data itself (taking
into account the methods of depersonalization used) and the risks of the environment (contextual
risks). Threats to the confidentiality of personal data arise as a result of authorized data processing, as
well as as a result of unauthorized access or actions of an attacker.


3. Risk model
        Taking into account the above factors, the study suggests considering the option of building a
risk model based on the combined use of methods for assessing data risks and contextual risks. In
order to carry out the risk assessment procedure, it is necessary to build a risk model that will
determine the risk factors and the relationships between them, based on the following sequential steps:
       - Risk factorization (identifying a set of individual risk components and establishing a link
between them);
        - Formation a release model of data;
        - Setting quantitative risk thresholds;
        - Determination of the necessary level of usefulness of the received depersonalized data;
       - Justification of the procedure for constructing a risk model for a specific depersonalization
procedure, including the possibility of re-evaluating the risk when using various depersonalization
methods.

                                                   370
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


        Conducting depersonalization taking into account the risk model requires a balance between
the usefulness of the data obtained as a result of depersonalization in accordance with various
estimated quantitative metrics (indicators) and an acceptable amount of risk. The risk thresholds are
set in accordance with the use scenarios (public data, inter-organizational or private access). Within
the framework of the model under consideration, depending on the purpose and objectives of
depersonalization, the following quantitative metrics (indicators) will be used:
        - risk level - the product of damage by the probability of the risk of re-identification;
        - data utility level or data quality assessment;
       - reversibility level, which allows you to maintain the connection of the original and
depersonalized data set;
        - variability of the depersonalization method.
       - flexibility, which evaluates the possibility of making additions (distortions) to the array of
depersonalized data.
        - the resistance of an impersonal set to attacks is determined by the probability of success of
re-identification attacks
        - compatibility of various impersonal sets (when comparing attributes), etc.
        In this case, the algorithm for calculating the risk of choosing a depersonalization strategy is as
follows (Figure 1)


                          Figure 1. Algorithm for implementing the risk model
      In this algorithm, contextual risks and data risks are calculated separately.


4. Features of building data risks
        As for contextual risks, they are an assessment of categorizable factors of organizational and
technical impact on the organization of the process of storing and converting personal data. Taking
into account the impact of these factors and their mutual influence, it is proposed to implement an


                                                    371
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


assessment of contextual risks on the basis of a scoring model, based on the implementation of the risk
calculator [1], while modeling the score card is carried out on the basis of the linear regression
method.
         Data risks are understood as the risks of re-identification associated with the structure and
composition of data. Access to such data may be obtained as a result of errors on the part of third
parties, service personnel or through applications (for example. REST API). In this case, it is advisable
to include the following methodological operations in the composition of the recommended model.
        1. Processing attributes to highlight direct identifiers and quasi-identifiers
         The risk of using depersonalized data consists in identifying a specific individual in the data
set and assigning to it those attributes that are contained in the set. This situation is called re-
identification. From the point of view of assessing the risk of re-identification, the most important are
sensitive attributes that, in the case of compromise, disclosure or illegal use, can lead to significant
damage, embarrassment and/or inconvenience. According to [3], it is customary to distinguish:
        - Direct identifiers (used directly)
        - Quasi-identifiers (used in combination)
        2. Planning possible re-identification attacks
         Under attacks on the confidentiality of PD, we will understand unauthorized actions on the
part of an attacker aimed at re-identifying the records of an individual inside an impersonal data set
[2]. The data risk assessment applied to a specific set of depersonalized data depends on the
depersonalization methods chosen - for suppression or aggregation. The selection of appropriate quasi-
identifiers requires taking into account various types of attacks, which can be combined:
        - re-identification attacks through linkage, linkage attacks is an attempt to identify an
individual through linking two sets of data;
        - attribution attack is carried out through the disclosure of attributes: the transfer to an
individual of the attributes of the group to which he supposedly belongs;
       - subtraction attack is aimed at reducing the original data set at the expense of additional
knowledge;
        - Inference attack - collecting available information to attack a more secure system;
        - differentiation attack involves the identification of a person's personality on the basis of
additional information about him, allowing us to assume his dissimilarity to the majority;
        - reconstruction attack is aimed at existing sets of aggregated data.
        3. Definition of anonymization level metrics.
        As a result of using depersonalization methods, data with varying degrees of anonymization is
obtained. There are several measures to measure anonymization. Most of them are based on the
concept of an equivalence class – the ability to allocate identical records within a data set in terms of
quasi-identifiers. Anonymity metrics are closely related to the frequency analysis of records, the
probability of re-identification is generally inversely proportional to such metrics.


        There are the following types of metrics and attacks on them:
        - k-anonymity;
        - ℓ-diversity;
        - t-closeness;
        4. Determination of utility level metrics.


                                                     372
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


        For a large amount of data, it is important to have quantitative estimates of the usefulness of
data that show the quality of data after applying depersonalization methods. Utility metrics (data
quality) can be quite complex. In this regard, it is recommended to use more than one of the following:
        - General metrics of information loss
        - Classification metric
        - Reuse metrics
        - Entropy-based information loss metric
        - A measure of mutual utility;
        5. Selecting the attacker's profile
        When calculating the risk probabilities, it is important to take into account the types of attacks
that can be generalized into attacker profiles. It is assumed that the attacker has the necessary
resources and knowledge to carry out the necessary attacks. The goals and availability of access to
additional information vary [4]. In accordance with the established tradition, the name of the profiles
is compared with three groups: "Marketer”, "Prosecutor" and "Journalist".
        6. Identification of scenarios for the distribution of depersonalized data
        Scenarios for the distribution of Release models play an important role in the process of
depersonalization, since they require various degrees of depersonalization. For example, for the public
dissemination of data, a higher level of protection is required. Data distribution scenarios depend on
several decisions that affect contextual risk and data risk, as shown in Figure 2 [5]:


                                   Figure 2. Data distribution scenarios


5. Conclusion
        It should be noted that the proposed combined version of the risk assessment model makes it
possible to comprehensively (at the level of contextual risks and data risks) conduct a detailed
analysis, and then a balanced choice of the method of depersonalization of personal data necessary for
application both at the enterprise and on a national scale. As a result, this circumstance brings novelty
and prospects to the solution of the issue under consideration.


                                                   373
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


References
[1] Handbook on Security of Personal Data Processing, ENISA, 2017.
[2] The Anonymization Decision-Making Framework, UKAN, 2016.
[3] General Data Protection Regulation (REGULATION (EU) 2016/679 OF THE EUROPEAN
PARLIAMENT AND OF THE COUNCIL of 27 April 2016 on the protection of natural persons with
regard to the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC).
[4] Goldberger, J. and T. Tassa (2010). Efficient anonymizations with enhanced utility. Transactions
on Data Privacy 3 (2), 149–175.
[5] Framework of de-identification process for telecommunication service providers, ITU-T X.1148,
2020.


                                                   374