=Paper= {{Paper |id=Vol-2797/paper32 |storemode=property |title=Detecting and Generalizing Quasi-Identifiers by Affecting Singletons |pdfUrl=https://ceur-ws.org/Vol-2797/paper32.pdf |volume=Vol-2797 |authors=Matteo Pastore,Maria Angela Pellegrino,Vittorio Scarano |dblpUrl=https://dblp.org/rec/conf/egov/PastorePS20 }} ==Detecting and Generalizing Quasi-Identifiers by Affecting Singletons== https://ceur-ws.org/Vol-2797/paper32.pdf
Detecting and Generalizing Quasi-Identifiers by
Affecting Singletons

Matteo Pastore*, Maria Angela Pellegrino**, Vittorio
Scarano***
*Dipartimento di Informatica, Università degli Studi di Salerno, Italy,
m.pastore48@studenti.unisa.it, {mapellegrino, vitsca}@unisa.it
**Dipartimento di Informatica, Università degli Studi di Salerno, Italy, mapellegrino@unisa.it
***Dipartimento di Informatica, Università degli Studi di Salerno, Italy, vitsca@unisa.it


Abstract: In order to adhere to Open Government doctrine, Public Administrations (PAs) are
requested to publish Open Data while preventing the disclosure of personal information of their
citizens. Therefore, it is crucial for PAs to employ methods that ensure Privacy-preserving data
publishing     by   distributing     useful    data  while    protecting    individual   privacy.
In this paper, we study this problem by providing a two phases approach. First, we detect privacy
issues by recognizing the minimum number of attributes that expose the highest number of
unique values (that will be referred to as singletons) as Quasi-Identifier. We test our approach
on real datasets openly published by the Italian government, and we discover that the quasi-
identifier (year_of_birth, sex, ZIP_of_residence) discloses up to 2% unique values in already
anonymized datasets. Once accomplished the detection phase, we propose an anonymization
approach to limit the privacy leakage. We investigate which combination of attributes must be
generalized to achieve the minimum number of singletons while minimising the amount of
modified and removed rows. We tested our approach on real datasets as in the previous phase,
and we noticed that by generalizing only rows corresponding to the singletons, we achieve nearly
no singletons while affecting only the 2% of rows.

Keywords: Privacy, Quasi-Identifies, Anonymization, Generalization


1. Introduction
Data owners are spur in opening up their data to enable informed decision making, ensure
transparency, audience engagement, and release social and commercial value (OKF). Open data (OD)
is any information that people are free to use, re-use, and redistribute - without any legal,
technological, or social restriction (OKF). Unfortunately, data in their raw and original form could
contain personal and sensitive information about individuals. Publishing such data violate
individual privacy (GDPR, 2016). Thus, data providers should perform Privacy-preserving data
publishing (PPDP) (Chen, 2009) to provide useful data without violating individuals' privacy.
According to the PPDP principles, data publishers have tables containing (Identifiers, Quasi-



Copyright ©2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
328                                                                                    Ongoing Research



Identifiers, Sensitive Attributes, Non-Sensitive Attributes) where Identifiers (IDs) are a set of attributes
that identifies record owners; Quasi-Identifier (QID) is a set of attributes that could potentially
identify record owners; Sensitive Attributes are person-specific information, such as, diseases,
salary, religion, political party (GDPR, 2016); while all the remaining attributes are Non-Sensitive.

   Data owners, such as public administrations (PAs), health care, and financial institutions, may
release the data they collect by de-identifying them, i.e., by masking, generalizing, or deleting IDs.
However, even anonymized public information may be re-identified by exploiting other pieces of
available data. A 2002 study found that 87% of the U.S. population can be identified using gender,
birthdate, and zip code as QID by matching anonymized hospital visit records and voting lists
(Sweeney, 2002). These data are not problematic if isolated but lead to the re-identification of
individuals by exploiting additional information. The individuals whose data are re-identified risk
of having their private information, such as their finances, health or preferences, and their identity,
sold to organizations without user conses (Porter, 2008) or disclosed to undesired end-users (Porter,
2008), or even it can cause the refusal of an insurance provision (Sweeney, 2002).

   The choice of the QID is an open question (Fung, 2010) since it depends on attributes that the
attacker can exploit to link actual data to any external source.

   In this article, we aim to prove that PAs current anonymization practices are not as waterproof
as first perceived. We propose an approach to identify the best QID by counting the number of
uniquely occurrences of a combination of values in a dataset. These unique occurrences will be
referred to as singletons. We define the best QID as the minimum number of columns/attributes that
leads to the disclosure of the highest number of singletons. By recalling that PPDP is interested in
minimising both the information loss and the privacy leakage (Fung, 2010), we interpreted privacy
leakage as the number of occurred singletons, and we estimate the information loss as the number
of suppressed and modified rows, decreasing thereby the overall dataset quality.
We focus on the following Research Questions (RQs):
       RQ1: What is the best QID according to our approach, and how many singletons does it
       disclose?
       RQ2: The generalization of which attribute or set of attributes among date of birth, ZIP and
       sex (one of the most common QID in the literature) leads to the minimum number of singletons
       while minimising the affected rows?

   The main contributions of this article can be summarized as follows:
     We designed and implemented an algorithm that classifies the minimum set of attributes that
     exposes the greatest number of singletons as the best QID where singletons are the unique
     occurrence of a combination of values for a set of attributes. Its implementation is freely
     available on GitHub (https://github.com/isislab-unisa/qid_identifier_and_anonymizer);
     We observed that by generalizing only singletons detected by the QID (sex, ZIP,
     date_of_birth), we obtain nearly no singletons, and we just affected up to 2% of rows. The
     experiments have been performed on real open datasets released by the Italian Ministry of
     Infrastructure and Transport (freely available on GitHub https://github.com/isislab-
     unisa/driver-license-datasets).
Ongoing Research                                                                                   329



   The rest of the article is structured as follows: in Section 2, we overview alternative approaches
proposed in the literature to detect QID and perform anonymization; in Section 3, we present our
approach to detect privacy issues by the number of occurred singletons, and we propose an
anonymization approach based on the generalization and suppression; in Section 4, we report and
discuss the performed evaluation; then, we conclude with some final considerations and future
research directions.


2. Related Work
In this section, we aim to overview the alternatives in detecting QIDs and corrective actions to
sanitise a dataset. According to the PPDP principles, we focus on microdata, i.e., data records about
individuals instead of data mining results.

2.1.   Detect Privacy Leakage Algorithms

Braghin et al. (2016) defined an approach to detect QIDs by considering both single columns and
their collections and counting the unique occurrence of values. All the QIDs are returned. Our
approach works similarly, but it elects the best QID as the smallest combination of columns enabling
the most extensive identification of singletons.

   Motwani and Xu (2007) exploit the separation and the distinct ratio to quantitatively describe the
ability of attributes to distinguish an individual from another. While the distinct ratio measures the
percentage of distinct values, the separation ratio measures the proportion of tuple pairs that can be
uniquely distinguished. We aim to find the QID by the separation ratio while keeping the distinct
ratio as a metric that does not contribute to the best QID election.

2.2.   Anonymization Approaches

The anonymization approaches explicitly removes IDs and hide the sensitive information assuming
that the latter should not be used in data mining algorithms. ID removal might not be enough: it is
still possible to recognise individuals by QIDs. To prevent linking attacks, the datasets must be
sanitized (Samarati, 2001) by applying anonymization operations among generalization, suppression,
anatomization, permutation, and perturbation (Federal committee of statistical methodology, 2005).

   Among the most famous privacy models, we cite k-Anonymity (Ciriani, 2007) that is based on the
fundamental concept that if a record has a particular value for a QID, at least other k-1 records will
have the same value for that QID. Multiple versions of k-Anonymity have been proposed to
overcome some of its limitations. For instance, (X-Y)-anonymity (Wang, 2006) face the case in which
multiple rows of the dataset are related to the same individual; MultiRelational k-Anonymity
(Nergiz, 2009) focuses on the anonymization of multiple relational tables; l-diversity
(Machanavajjhala, 2007) guarantees that each equivalence class has at least l well-represented values
for each sensitive attribute overcoming the risk of the homogeneity attack; ( , k)-anonymity (Wong,
2006) experiences local recording by reducing data distortion; t-closeness (Li, 2007) requires that the
distance between the distribution of a sensitive attribute and the distribution of the attribute in the
overall table should be no more than t.
330                                                                                Ongoing Research



  Our approach is a modified version of k-Anonymity where:
    k is at least equal to 2;
    we discourage suppression in favour of generalization;
    k-Anonymity operates at a global level, we can also locally work.


3. Our Approach for Detecting and Solving Privacy Issues
Our approach is based on privacy issues detection, followed by an anonymization approach based
on generalization and suppression. The workflow starts from the human provision of the dataset to
test, and it automatically returns the best QID. If the best QID matches (year, municipality, gender),
it also provides the corresponding generalization.

3.1.   Detection of Privacy Issues

We interpreted the detection of privacy issues as the occurrence of unique values by considering a
single column or a combination of them. These unique values are referred to as singletons. We elect
IDs and QIDs by the number of occurred singletons. The implemented pseudo-code follows:
def detect_ID_and_QID(dataset):
     identifiers = []
     stats = {}
     for size in range(1, num_columns):
          # all the dataset column subsets of "subset_size"
          subsets = get_subsets(columns_to_check, size)
          # IDs: columns containing all distinct values
          temp_IDs, still_to_check = get_IDs(dataset, subsets)
          IDs += temp_IDs
          # for each subset, it stores the singletons number
          stats.update(get_stats(dataset, still_to_check))
          columns_to_check = list(set(still_to_check))
      # best QID: the smallest subset of columns exposing the greatest number of singletons
      best_QID = get_best_QID(stats)

   In the best QID election, we first consider the number of singletons detected by each set of
attributes. If more than one set of columns share the same number of singletons, we elect as best QID
the minimum set of columns.

3.2.   Anonymity Based on Generalization and Suppression

To ameliorate the detected privacy issues, we propose an anonymization approach based on
generalisation and suppression. We suppress incomplete rows at the beginning of our technique;
then, only the generalization is permitted. We favor the generalization to the suppression since we
prefer to publish incomplete information rather than preventing the publication at all.

   We focus on the QID (date_of_birth, ZIP, sex) since it is a well-known QID. In particular, we focus
on a slightly simplified version of this QID, where we only have access to the year_of_birth. It is
simplification without loss of generality since it can be easily generalized to the entire date.
However, by only considering years, we can straightforwardly generalize dates by the mean value
Ongoing Research                                                                                   331



of a range. Moreover, in Italy, there is a two-way correspondence between ZIP codes and
Municipalities. Therefore, they can be used interchangeably. We aim to detect which column or
combination of columns is worth to generalize to achieve the minimum number of singletons by
modifying the minimum amount of rows.

   For the numerical attribute (i.e., year_of_birth), we consider the standard approach of
substituting values by intervals. Therefore, we sort rows by year_of_birth, split all the rows into
groups of at least k values. If we split two rows containing the same year in creating groups, we
iteratively merge those rows in the same group until the cut splits rows with different years. Each
year is substituted with the interval [min_year, max_year). We also apply a second strategy where the
average value of the interval replaces each interval. We hypothesize that if current years mainly
correspond to the mean value of the range, fewer rows will be modified while still reducing the
number of singletons.

   About the sex column, we replace male and female values by any gender. In this case, the
generalization plays the same role as cell suppression. About the municipality column, we exploit
the hierarchy induced by our national administrative levels: municipalities are generalized by
provinces, provinces by regions, regions by states. In our experiments, we only consider the first
level of this hierarchy by generalizing municipalities by provinces. For categorical attributes (i.e.,
sex and municipality), we apply both a global and a local recording. While in the global recording
we affect the entire dataset, at a local level, we only modify rows related to the singletons disclosed
by the best QID. We hypothesize that the local recording can introduce a sufficient level of privacy
protection while affecting a minimal number of rows (i.e., slightly decreasing the overall dataset
quality). The implemented approach can be resumed as follows:
      the rows containing empty values are dropped out, and the removed rows alter the counter of
      affected rows. The full version of the dataset (i.e., only rows without any empty cell) is
      considered in the following steps;
      the following generalizations are performed:
          all the municipalities are generalized by the corresponding province;
          only the municipalities of the rows corresponding to the singletons detected by the best
          QID are generalized by the corresponding province;
          all the values of the sex column are generalized by any gender;
          only the values of the sex columns corresponding to the singletons detected by the best QID
          are generalized by any gender;
          all the birth_years are generalized by the corresponding intervals;
          all the birth_years are generalized by the mean value of the intervals generated as before;
          all the attributes are generalized by combining every pair of the generalizations described
          so far and by generalizing all fields at once;
      for each performed generalization, we compute the number of singletons, the percentage of
      singletons, the number of distinct values, the number of modified and removed rows;
      we elect the best generalization by considering the one that achieves the minimum number of
      singletons while affecting the minimum number of rows.
332                                                                                 Ongoing Research



4.      Evaluation
We tested our approaches on real datasets released by the Italian Ministry of Infrastructure and
Transport. These (anonymized) datasets contain information related to the driver licenses of all the
Italian regions. We downloaded them in October, 2019 from the official site
(http://dati.mit.gov.it/catalog/dataset/patenti). In January 2020, the Ministry updated the online
version by significantly reducing the (already minimal) available content. The tested datasets (in
their original form) are available on the GitHub (https://github.com/isislab-unisa/driver-license-
datasets).

4.1.   Detecting Privacy Issues

We selected only the columns related to personal information (i.e., the municipality and the province
reported as the driver residence, the year of birth, and the sex), while we discarded all the Non-
sensitive information, such as the driver's license details, and the numerical ID. The algorithm is
linearly correlated to the dataset size. It takes 0.0152 seconds for processing datasets with 6M rows.

Results of the QID identifier. The results of our privacy issue detecting approach are available on
GitHub, here omitted due to the lack of space. [Birth_year, ZIP, Sex] is reported as best QID in all the
regions. Even if datasets are anonymized, our approach highlights the possibility of distinguishing
up to 2% (1.93%) of singletons uniquely. If 2% seems to be a negligible amount of disclosed
identities, it is worth noticing that the maximum number of disclosed singletons is more than 25K.
It implies that removing IDs is not enough, and further anonymization actions must be performed
to publish sanitized datasets. These results reply to RQ1.

4.2.   Anonymization Approach Based on Generalization

We considered three datasets used in the previous analysis, heterogeneous in the disclosed
percentage of singletons. Results are provided in Table 1.

Year range VS Year mean value. By comparing the year generalization by intervals (row Y_ran of
Table 1) and by mean values (row Y_avg of Table 1), we observed that the mean value gains 95%
fewer singletons while affecting 7% rows less than the range approach.

Global VS Local recording. By comparing the global generalization of the Municipality (M row of
Table 1) and its local recording (M_loc row of Table 1), we observed the local recording achieves
results close to 0, while the global recording succeeds in completely avoiding the disclosure of
singletons. On the other side, the global recording affects the entire dataset, significantly decreasing
the dataset quality, while the local recording affects only the 2% of the rows. We consider a good
thread-off between privacy-preserving and data quality the generalization of the municipality (and
the sex) only of singletons disclosed by the QID (year_of_birth, sex, ZIP) (RQ2).

Comparison with k-Anonymity. We used the Valle d'Aosta dataset to compare our approach and k-
Anonymity (https://dzone.com/articles/an-easy-way-to-privacy-protect-a-dataset-using-pyt). We run k-
Anonymity by generalizing Municipalities by their Provinces, Sex by any gender, and the Year by
intervals of width 4. We consider the generalization of any set of attributes. We allowed suppression
Ongoing Research                                                                                               333



equals to 0.01 in all cases, but in the sex, generalization is set to 0.05. At the global level, both our
approach and k-Anonymity modify almost the entire dataset. While we obtain a generalized version
of the dataset information, k-Anonymity removes many rows. When a small number of singletons
occurs, k-Anonymity drops the corresponding rows. In all the other cases, it drops at least 200 rows
and modifies all the other ones. Thanks to the local recording performed by our approach, we can
obtain a minimum number of singletons (near to 0) while affecting a small portion of the dataset (up
to 2%). Concluding, we achieve the same results in privacy-preserving, while our results lead to
better data quality thanks to the local recording.


Table 1: Results of the anonymization approach. #S is the number of singletons, and %S its percentage,
size is the full dataset size, DV stands for Distinct values. The first row of the table reports the number of
singletons disclosed by the best QID (year_of_birth, sex, ZIP). Then, we report the effect produced by our
generalization algorithm obtained by affecting the columns reported in the first column of this table. The rows
entitled with the suffix loc are related to a local recording approach, while the others correspond to a global
one. S is related to the sex column; M to the Municipality column; Y to the Year_of_birth column. While
Y_ran represents the generalization of years by range, Y_avg exploits range mean value.

               Valle d'Aosta                    Molise                          Umbria

               #S      %S      Size     DV      #S    %S     Size      DV       #S      %S     Size      DV

               1,684   1.93    87,464   9,174   869   1.30   597,243   13,391   2,569   0.15   198,312   16,628

 Cols          #S      %S      MR       DV      #S    %S     MR        DV       #S      %S     MR        DV

 S_loc         1,264   1.45    1,684    8,964   745   0.12   869       13,329   2,219   1.07   2,569     16,408

 M_loc         4       ~0      1,679    7,501   7     ~0     860       12,539   7       ~0     2,556     14,078

 S, M_loc      1       ~0      1,684    7,785   32    ~0     860       12,680   3       ~0     2,569     14,446

 S             621     0.71    87,464   5,166   295   0.05   597,243   7,101    909     0.46   198,312   9,490

 M             4       ~0      87,464   167     127   ~0     414,584   338      7       ~0     198,312   324

 Y_ran         198     0.002   87,464   2,739   1     ~0     556,875   3,641    325     ~0     198,312   4,806

 Y_avg         9       0.001   81,476   607     1     ~0     597,243   736      3       ~0     198,312   1,089

 S, M          1       ~0      87,464   85      48    ~0     597,243   171      7       ~0     198,312   324

 S, Y_ran      70      ~0      87,464   1,442   0     0      597,243   1,895    116     ~0     198,312   2,606

 S, Y_avg      5       ~0      87,464   308     0     0      597,243   368      1       ~0     198,312   545

 M, Y_ran      0       0       87,464   43      0     0      597,243   44       0       0      198,312   4

 M, Y_avg      0       0       85,948   8,817   0     0      584,794   8        0       0      194,932   16

 S,M,Y_ran     0       0       87,464   22      0     0      597,243   44       0       0      198,312   43

 S,M,Y_avg     0       0       87,464   4       0     0      597,243   8        0       0      198,312   8
334                                                                                            Ongoing Research



5. Conclusions and Future Work
In this article, we propose to detect QID by counting singletons in a dataset. We observed that the
best QID (date_of_birth, sex, ZIP) discloses up to 2% (and up to 25K) of singletons in already
anonymized datasets. When a privacy leakage is reported, PAs usually react by closing data or
publish poorly informative datasets. As an example, the datasets we analyzed were substituted with
a version with significantly lower informativeness, as only the province of residence, driving license
category, and release date are provided. These datasets, in our opinion, are reduced to pointless Open
Data. Instead of making data useless, we suggest investing in further sanitation actions. In this
article, we observed that we achieve the minimum number of modified rows (up to 2% of affected
rows) while obtaining the number of singletons close to 0 thanks to a local recording. We aim to
provide our proposal as a framework to support PAs in publising datasets significantly more
informative than the currently available on their websites while preserving citizens privacy.

References

Braghin, S., Gkoulalas-Divanis, A., Wurst, M. (2016) Detecting quasi-identifiers in datasets (US Patent 15 193
    536)

Chen, B.C., Kifer, D., LeFevre, K., Machanavajjhala, A. (2009) Privacy-preserving data publishing. Foundations
    and Trends in Databases (2), pp. 1 167

Ciriani, V., di Vimercati, S.D.C., Foresti, S., Samarati, P. (2007) k-anonymity. In: Secure Data Management in
     Decentralized Systems, pp. 323 353

European Regulation 2016/679 of the European Parliament and of the Council (2016) GDPR, https://eur-
    lex.europa.eu/eli/reg/2016/679/oj, last access March, 2020

Federal Committee on Statistical Methodology (2005) Statistical policy working paper 22. Report on Statistical
    Disclosure Limitation Methodology

Fung, B.C.M., Wang, K., Chen, R., Yu, P.S. (2010) Privacy-preserving data publishing: A survey of recent
    developments. ACM Computing Surveys 42 (4), 14:1 14:53

Li, N., Li, T., Venkatasubramanian, S. (2007) t-closeness: Privacy beyond k-anonymity and l-diversity. In: IEEE
     23rd Inter. Conf. on Data Engineering. pp. 106 115

Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M. (2007) l-diversity: Privacy beyond k-
    anonymity. ACM Trans. Knowledge Discovery Data 1 (1), pp. 3

Motwani, R., Xu, Y. (2007) Efficient algorithms for masking and finding quasi-identifiers. In: Proceedings of the
    Conference on Very Large Data Bases. pp. 83 93

Nergiz, M.E., Clifton, C., Nergiz, A.E. (2009) Multirelational k-anonymity. IEEE Trans-action on Knowledge
    and Data Engineering 21 (8), pp. 1104 1117

OKF: Open data, https://okfn.org/opendata/, [last access May, 2020]

Porter, C.C. (2008) De-identified data and third party data mining: the risk of re-identification of personal
     information. Journal of Law, Commerce Technology (5), pp. 1
Ongoing Research                                                                                             335



Samarati, P. (2001) Protecting respondent                              . IEEE Trans-actions on Knowledge and
    Data Engineering 13 (6), pp. 1010 1027

Sweeney, L. (2002) Achieving k-anonymity privacy protection using generalization and suppression. International
    Journal Uncertainty Fuzziness Knowledge -Based System 10 (5), pp. 571 588

Wang, K., Fung, B.C.M. (2006) Anonymizing sequential releases. In: 12th ACM SIGKDD Int. Conf. on
   Knowledge Discovery and Data Mining. pp. 414 423

Wong, R.C.W., Li, J., Fu, A.W.C., Wang, K. (2006) )-anonymity: An enhanced k-anonymity model for privacy
   preserving data publishing. In: ACM SIGKDD on Knowledge Discovery and Data Mining. pp. 754-759.


About the Authors

Matteo Pastore
Matteo Pastore was Born in Salerno in 1998. He will graduate at the University of Salerno in 2020. He will
continue his studies with a Master's degree in Computer Science. He is interested in Cloud Computing,
Cybersecurity, Open Data, Networks.

Maria Angela Pellegrino
Maria Angela Pellegrino, born in 1994, graduated in Computer Science at Master Level in 2018 at the
University of Salerno. From 2018, she is a Ph.D. student in Computer Science at the University of Salerno
under the supervision of Professor Vittorio Scarano. Her studies focus on how to improve the quality of
(Linked) Open Data while preserving the privacy of individuals. Research interests include data quality,
privacy, Open Data, and Semantic Web.

Vittorio Scarano
Vittorio Scarano is a Full Professor of Computer Science at the University of Salerno (Italy).
Since 1996, he (with Alberto Negro) funded and co-directs the ISISLab laboratory within the Department.
ISISLab has been hosting, until now, the research activity of 20 PhD students, more than 20 collaborators
(grants, fellowships) and provided support for more than 120 theses (Bachelor and Master level) with
computing facilities such that a 38-nodes IBM cluster, file and application servers, teaching lab, a
stereoscopic screen, augmented reality devices etc.He is co-author of more than 140 papers in
internationally refereed journals and conferences of IEEE, ACM, etc. and he has been the PhD supervisor of
several PhD students of the Computer Science PhD program at the University of Salerno. In 2000, he has
been awarded (with his co-                  t poster award'' at the 9th World Wide Web Conference (WWW9).
In 2008, he has been awarded by IBM with the International IBM Jazz Innovation Award as a grant of 25.000$.
He coordinated the European funded research H2020 project ROUTE-TO-PA "Raising Open and User-friendly
Transparency-Enabling Technologies for Public Administrations" (grant agreement No 645860) with 12

and regional funded research and innovation projects. Research interests include Distributed Systems,
Collaborative Systems and Open Data, and Enhanced (Virtual/Augmented) Reality.