Methods and Techniques for Data Quality Improvement of (Linked) (Open) Data Maria Angela Pellegrino Dipartimento di Informatica, Universitá degli Studi di Salerno, Italy mapellegrino@unisa.it Abstract. Good decisions need good data. Hence, only by exploiting good data it is possible to make effective decisions. The goodness of data is usually related to the task they will be used for. However, it is possible to identify some task-independent quality dimensions which are merely related to the data themselves. In order to improve the intrin- sic data quality, we propose a proactive approach. Our goal is to offer data providers (and consumers) a set of methods and techniques to guide them in assessing and improving the quality of data they are interested in. We mainly focus on Linked (Open) Data. Since the published data might also contain personal data, there is the need to make the data set compliant with the General Data Protection Regulation (GDPR). Therefore, besides quality problems, we are also interested in discover- ing any privacy breach and - if needed - in proposing corrective actions. The final goal is to give data providers the possibility of publishing bet- ter data. The proposed approach is pragmatic. Thus, we will not only design but also implement it. We plan to wrap it into a social platform, already used by several public administrations, which enable us to test the applicability of the proposed methods in real settings. Keywords: Data quality, Privacy breaches, Quality assessment, Quality improvement, Privacy awareness, Data publication 1 Problem statement Data Quality (DQ) can be defined as “as the level of compliance of the data with the purpose they will be used for ” [1]. Thus, data quality is defined in terms of fitness of use. The exploitation of data not ready-for-use may lead to incorrect conclusions and poor decisions. Only by using high-quality data it is possible to achieve effective decision making. Thus, data providers should make an effort to improve the quality of data sets under definition to simplify their exploitation. Moreover, data providers might also deal with data sets containing personal data. The General Data Protection Regulation (GDPR) [22] defines which data are considered personal and in which case the individual privacy can be compromised in order to increase the utility for the community. Therefore, data providers have to make data sets compliant with the GDPR before the publication. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Maria Angela Pellegrino The problem we want to face is how to design a unified approach which allows to publish high-quality data while preserving individual privacy. Since we are interested in both Quality and Privacy, we call our approach Qualicy aware. The general workflow to assess and improve quality/privacy aspects should be: 1) choose the quality dimensions of interest, 2a) assess the quality and 2b) de- tect privacy problems, 3a) improve the overall data set quality and 3b) prevent privacy breaches. By a reactive approach, the quality can be improved after the data set publication, for instance when it has to be used in a practical use case. The reactive philosophy can be summarised by “publish first, refine later ”. The alternative is to adopt a proactive approach by improving the data set quality as early as possible. Ideally, it might be improved during the publishing phase. Our proposal is to provide data publishers a set of (semi-)automatic techniques to identify quality problems and improve the overall quality of a data set before its publication. Moreover, we aim to take privacy concerns into account and pre- vent personal information leakage. Our pragmatical approach will be integrated into SPOD (Social Platform for Open Data) which can be used by citizens, Pub- lic Administrations (PAs), associations, and every kind of stakeholder in order to produce and consume Open Data (OD) also in Linked format. Our approach must not require technical skills to be compliant with the SPOD audience. Since data can be both in tabular and linked format, we plan to work with data in general and try to define strategies independent of data format. Only when it is necessary we intend to use peculiarities of the specific data format. In conclusion, we can summarise our goal as the definition of strategies to assess and improve data quality and manage privacy aspects of (Linked) (Open) Data. The parentheses delimit the parts which can be omitted. In other words, we plan to work i) with Data in general, ii) with Open Data in tabular format (3-star data according to Tim Berners-Lee’s rating system [3]), iii) with Linked Data, iv) also released with the open license as 5-star data [3], i.e. Linked Open Data. 1.1 Data Quality Several quality dimensions and taxonomies have been defined to evaluate data quality. Ballou and Pazer [2] identify accuracy, completeness, consistency, and timeliness as main quality dimensions. Wand et al. [29] classified quality dimen- sions in intrinsic, accessibility, contextual, and representational DQ: data should be i) intrinsically of a good qualitative level and ii) accessible; iii) they should be compliant with the context they will be used for and iv) also the format itself should be qualitatively good. Besides these general definitions, data quality di- mensions can been specialised for the Linked (Open) Data (LOD). According to Zaveri et al. [30], accuracy and completeness belong to the intrinsic data quality. They further distinguish syntactic and semantic accuracy. Syntactic accuracy. A value is syntactically accurate when it is valid, i.e. it belongs to the set of acceptable values according to the domain of interest [12]. Therefore, the syntactic accuracy (also called syntactic validity) is the degree of Data Quality Improvement of (Linked) (Open) Data 3 conformity to the syntactic rules determined by the modelled domain. The metrics identified for the syntactic validity are – detecting the explicit definition of the allowed values for a certain data type, – detecting the compliance of values with syntactic rules (e.g. patterns), – detecting the presence of outliers, – detection of typos in literals. Semantic accuracy. According to Zaveri et al. [30], the semantic accuracy is defined as the degree to which data values correctly represent real-world facts. For instance, supposing that the flight between Paris and New York is A123, while in a data set the same flight instance is represented as A231. In this case, the instance is semantically inaccurate since the flight ID does not represent its real-world state [30]. The metrics identified for semantic accuracy are: – detection of outliers by using distance-based methods, – detection of inaccurate values comparing values of different properties, – detection of inaccurate classifications and labelling. Completeness. Fürber et al. [12] classified completeness into i) schema complete- ness, ii) column completeness, iii) population completeness, and iv) interlinking completeness. The Schema completeness is the degree of completeness of the on- tology, i.e. there are no relevant classes and properties not represented in the ontology. The column completeness can be defined as the number of missing val- ues for a specific property/column. The population completeness is the percentage of the coverage of all the real-world objects of a particular type represented in the data sets. The interlinking completeness (specific for LOD) refers to the degree to which the instances contained in the data set are interlinked. 2 Relevancy By providing methods to publish high-quality data, the effort and the time needed to make data ready-for-use will be reduced. Since we want to improve the (Linked) (Open) Data quality, the problem is relevant for all data publishers, contributors, and consumers. Moreover, the proposed approach will be wrapped into SPOD which is already adopted by several users, such as our national Public Administration and cultural associations. Therefore, on one side they can benefit from our results; on the other side, they can also be involved in the evaluation phase of our approach in order to assess its applicability in real settings. 3 Related work Linked (Open) Data quality assessment. SWIQA [12] is a quality assess- ment framework which relies solely on Semantic Web technologies, without any external source. As our proposed approach, SWIQA may be used both by data 4 Maria Angela Pellegrino consumers to find high-quality data sources and by data owners to evaluate the quality of their own data. They selected quality dimensions which rely only on the data source, without caring about the specific task they will be used for. Thus, they aim to provide an objective - i.e. task independent - quality assess- ment. If we limit ourselves to the intrinsic DQ, they cover syntactic and semantic accuracy, completeness, timeliness, and uniqueness. In general, they consider a wider range of metrics. They evaluate the metrics based on the Closed Worlds Assumption (CWA), i.e. everything that is not known can be assumed as false. This hypothesis is due to the metric definitions. However, typically the Semantic Web assumes an open world, i.e. everything we do not know is not defined yet. The CWA might be not always applicable since LOD suffer from incompleteness. Sieve [19] is based on the opposite assumption: it considers data quality strictly dependent on the task. Therefore, the user can customise the settings by spec- ifying metrics, scoring functions, and aggregation functions in an XML file. It evaluates both the semantic accuracy and the completeness of the queried LOD. About how to assess data quality and display results, Langer et al. [15] report a clear workflow to evaluate a set of metrics and report results. Looking at the pro- vided results, users can also change quality desiderata. It implies a cyclic process in order to define/evaluate/refine quality metrics. This theoretical workflow is implemented into SemQuire [15] which is focused on quality assessment. SHACL Shapes Constraint Language1 is a W3C standard to validate LOD against a set of conditions. It is useful for different purposes, e.g. data integration. Other interesting works can be found in a survey written by Zaveri et al. [30]. Cited work focus only on data quality assessment without considering the im- provement step. Moreover, they do not provide a privacy-aware process. Linked (Open) Data quality improvement. In his survey, Hadhiatma [13] underlines the need for a framework which helps in improving the LOD data quality. They count several approaches which exploit inductive learning methods in order to enrich and complete LOD. Among them, Paulheim [23] defined an algorithm able to detect co-occurrences and patterns in DBpedia types. Sleeman and Finin [26] worked on a labelled training set to predict the type of instances. In general, machine learning, statistical methods, and external knowledge are the mainly employed methods to detect patterns and find missing information [13]. DaCura [8] is a framework developed to help data set curators. Because their users may not have technical skills, we share the same audience. The framework is made up of a collection of tools able to detect and curate quality problems over the evaluation of linked data sets. Therefore, it is used both to assess and to improve the data set quality. DaCura and our proposed approach share the idea that the quality should already be affected in the definition stage. Moreover, the process has to be cyclic. If we consider only the intrinsic DQ, Freeney et al. [8] address both the accuracy and the completeness quality metrics. In general, they consider a wider set of metrics, including several metrics which we are not 1 https://www.w3.org/TR/shacl/ Data Quality Improvement of (Linked) (Open) Data 5 considering at the moment. On the other side, our goal is to consider both the quality and privacy awareness - which is completely absent in DaCura. Privacy awareness in LOD. A typical content-based data leakage preven- tion system (DLPS) works by monitoring sensitive data mainly by using regular expressions, data fingerprinting and statistical analysis. Regular expressions are normally used under a certain rule such as detecting social security numbers and credit card numbers. Dataguise, a leader in data privacy protection and com- pliance, will demonstrate how DgSECURE is supporting enterprise administra- tors as the basis for secure data analytics, application testing and development, and the general protection of sensitive data across enterprise cloud reposito- ries. DgSECURE [6] enables you to discover, count, and report on sensitive data assets through a sophisticated regular expression (regex) pattern builder; it com- bines structured, semi-structured, or unstructured content and it finds sensitive data - such as credit card numbers, SSN, names, email addresses. For what con- cerns the anonymization, it is well-consolidated approach [16,18,28] in relational data. However, its counterpart on LOD is still under development [17,31]. The main concern is that both the de-anonymization techniques and LOD base their strength on interlinking. However, researchers working on heterogeneous graph de-anonymization are trying to reuse and adapt approaches already used in a homogeneous graph, e.g. social networks. These approaches are mainly based on clustering and graph modification [33]. One of the considered approaches is k-RDF-Neighbourhood Anonymity [14] which adapts the k-Neighbourhood [32] algorithm to LOD released as RDF graphs. It is rare to find a technique able to manage both the graph structure and the attributes attached to each node. k-Neighbourhood is able to manage both structural and attribute aspects. 4 Research questions Our research questions (RQs) deal with both the assessment and the enhance- ment steps and both considering quality dimensions and the avoidance of privacy breaches. The RQ related to the assessment steps can be summarised as follows: RQ1. To what extent data quality and privacy concerns can be assessed indepen- dently of the data format? In which case - if any - is there the need to consider the original data format? In order to define only once the quality dimensions and reuse it both for OD and LOD, we want to investigate if the quality dimensions - in particular focusing on accuracy and completeness - can be defined indepen- dently of the data format without loosing in precision. The same consideration holds for privacy aspects. RQ2. Can (automatic) data type inference be useful (in terms of effectiveness and efficiency) in (linked) (open) data quality assessment? RQ3. Can (automatic) data type inference be useful (in terms of effectiveness and efficiency) in discovering privacy breaches? 6 Maria Angela Pellegrino The main RQ related to the improvement phase can be summarised as follows: RQ4. How to improve data quality while preventing privacy breaches? Research questions are presented in the same order in which they are considered during my Ph.D. It also justifies the different degree of refinement of the RQ. During this year (which is my first year of Ph.D.), I will mainly focus on quality and privacy assessment, while in the following years I will focus, first, on how to improve data quality and, then, how to manage privacy leakages. Therefore, the fourth question will be further refined in the future. 5 Hypotheses At this stage of the work, we are able to hypothesise results only about the quality and privacy assessment. H1 is related to RQ1, while H2 is related to RQ2 and RQ3. H1. We hypothesise that it is possible to define approaches which work directly on values (or their collection) without caring about the original data set format. H2. We consider our automatic data type inference (which will be detailed in section 6) a suitable method to address both quality problems and privacy con- cerns. We defined and implemented an approach to automatically infer the type exclusively working on values. Inferred data types are consequently used i) to give an insight about quality aspects and ii) to detect if privacy breaches occurred. The performance and the scalability of this approach have been tested on open data sets organised in tabular format. In the near future, we aim to verify if we gain the same (positive) results also on LOD. In particular, we plan to verify if it returns correct results and if it is the most efficient way to manage it. If so, it is a first step in defining promising techniques that are independent of the original data format. Consequently, we should verify if the same consideration can be expanded to other assessment and enhancement approaches. 6 Preliminary results We designed and implemented an approach [9] to assess the quality level and the occurrence of privacy leakage working on the actual content of each value. Our approach infers not only basic data types (such as string, number, date) but also meta data types inspired by the GDPR. The novelty of our approach does not lie in the type inference step, but in the exploitation of inferred types both to assess quality aspects and to detect privacy breaches. About privacy breaches, we check if i) a content privacy breach occurred: we verify if a description (i.e. any field classified as string without a refined meta data type) contains any structured sensitive information, such as phone number, IBAN, SSN; ii) a structural privacy breach occurred: by considering meta data types attached to the columns, we verify if data provider badly designed the data set by forcing users to fill in cells with personal information. More in detail, our approach works as follows: Data Quality Improvement of (Linked) (Open) Data 7 – it takes as input a data set seen as a collection of columns * for each column seen as a collection of values 1. for each value, the data type is inferred according to its content. Our approach attaches to each value a basic data type - such as number, string, and date - and (if possible) a meta data type - such as province, municipality, ZIP code, SSN, IBAN, email, address, surname, name and so on. The latter is used to capture the semantic of the value. By default, each value is a string as basic data type and it has no meta data type; 2. for string values (i.e. for the values which the type inference approach fails in refining the meta data type), the proposed approach checks if a typo occurs. In other words, it verifies if by replacing, adding, removing or swapping letters a known meta data type is matched; 3. if also the typo check fails, the proposed approach verifies if the value contains a structured personal data, e.g. if it contains an IBAN, an email, an address. In that case a content privacy breach occurred; * to each column we assign the most frequent data types among its values; * for each collection, the completeness and the accuracy are computed; – once a data type has been attached to each column, the type inference mod- ule checks if a structural privacy breach occurred. By structural privacy breach we mean both the presence of information which exposes individ- ually personal details - e.g. SSN - or the co-occurrence of quasi-identifier, i.e. bits of information which identifies unequivocally an individual - e.g. the co-occurrence of date of birth, gender and ZIP code. Both the correctness and the scalability of this approach have been evaluated on open data sets [9]. The approach is completely independent of the data format: starting from a tabular or a linked data set, it is possible to work on each value and to assess the quality and privacy aspects by our approach. We already provided SPOD with a prototype of this approach. Moreover, SPOD is also enhanced with a component [7] to query LOD by SPARQL and organise the results into a tabular format. The results of a SELECT query can be always organised in tabular format. Therefore, we plan to verify the applicability of our approach also to LOD by organising queried data in a tabular view. 7 Approach Our pragmatic approach aims to help data providers 1a) both in assessing data quality problems and 1b) in identifying privacy leakages and 2) in providing ef- fective and efficient strategies in solving detected problems. It is interesting to notice that by improving a quality dimension, the other ones could be compro- mised. For instance, by making data compliant with the GDPR, the complete- ness could be compromised: to anonymise ZIP codes we might omit the last two digits. In this way, the completeness (and also the accuracy) is affected. This symbiosis of causes and solutions of quality and privacy aspects should be taken into account when defining a data quality assurance process. The entire process can be implemented as a cyclic approach, summarised as follows: 8 Maria Angela Pellegrino 1. data quality assessment (a) definition of the quality dimensions to assess (b) measurement of the chosen quality dimensions (c) representation of the results of the measurements 2. data quality improvement 3. check if the improvements negatively affected the other quality dimensions (also called validation) In the validation step all the measurements for all the considered quality di- mensions must be repeated. Thus, the validation step matches the measurement step. The approach can be graphically represented by Figure 1. Fig. 1. Schema of the proposed approach. Quality assessment. We decided to focus on accuracy and completeness. Syntactic accuracy. We plan to: – apply our type inference approach on LOD. Then, we want to compare in- ferred data types with data types specified in the queried LOD. For instance, supposing to test all the values of the dbo:birthDate property. We can verify if they are correctly recognised as dates. – enhance the type inference approach to recognise patterns which are attached to syntactic rules reported in the queried LOD. For example, supposing that a relation has a data type pattern not supported by our type inference approach (e.g. date time), we can add the regex to recognise it and identify any syntactical wrong values; – exploit either clustering algorithms or statistical approaches to detect out- liers. At this moment, we are following the same approach described by Fleischhacker et al. [10] and we are comparing DBscan, IQR, and Z-score in order to verify which is the most accurate and efficient technique; – compare the actual typo detection approach (part of the type inference pro- cess) with clustering algorithms. The main drawback of the first approach is the scalability: for each string for which a typo is hypothesised, it computes all the words by adding, removing, swapping or replacing a letter against the original word. Obviously, this naı̈ve approach explodes if we consider more than one error. Our hypothesis is that a clustering algorithm achieves Data Quality Improvement of (Linked) (Open) Data 9 better results, gaining also in efficiency. The main difficulty is in detect- ing clustering algorithms able to deal with strings. At this moment, we are comparing k-means and the agglomerative clustering to identify the most accurate and efficient approach. An alternative is to exploit word or graph embedding techniques to convert words (or the whole graph) into vectors and use them to feed in clustering algorithms. Right now we are evaluating the performance of KGloVe [5] and RDF2Vec [24] upon the clustering task. Semantic accuracy. We plan to check if a set of values is semantically accurate, by forecasting the values by other properties and then compare the predicted against the actual ones. We aim to exploit either link prediction or external resources. External resources are strictly dependent on the tested source. For example, in order to validate data in DBpedia, we can use other well known Knowledge Bases, such as Wikidata or Freebase; Completeness. We plan to compare the column completeness calculated by the type inference module against the one calculated directly on the graph. We will further consider how to evaluate the other completeness dimensions. Detection of privacy breaches. GDPR categorises as personal data “any information relating to an identified or identifiable natural person; an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, ... ”. While some bits of information may not be uniquely identifying individuals on their own, they can be potentially identifying individuals when combined with other attributes [21,27]. The combination of these attributes is defined as quasi-identifier. [22]. Both the occurrence of personally identifiable information (PII) and quasi-identifier are detected by our data type inference approach. It can be further enhanced to recognise a wider range of structural privacy breaches. We also deal with content privacy breach. As an alternative, we could exploit sentiment analysis techniques or classification algorithms to distinguish sensitive and not-sensitive information. Data quality improvement. To improve the data quality, we plan to perform data enrichment. It can be realised by exploiting clustering algorithms in order to identify new classes and grouping. To address the completeness requirement and enrich the data set, we plan to apply link prediction techniques. By data cleansing approaches, we aim to recognise erroneous data and clean them. Applying Machine Learning (ML) approaches to LOD raises several diffi- culties: LOD lack of negative examples [25], and performing the feature extrac- tion phase on graphs is particularly expensive. Besides applying ML algorithms directly on LOD, entities (and relations) can be vectorised by graph embedding techniques [4,5,24]. The obtained vectors will then be fed in ML algorithms. Privacy leakage avoidance. The privacy-preserving data publishing [11] guar- antees methods and tools to publish data set by reaching a good trade-off be- tween privacy preservation and the overall utility of the published data set. The 10 Maria Angela Pellegrino anonymization techniques - also suggested by the GDPR - hide personal data based on the idea that they should not be involved in statistical analyses. A naı̈ve solution is the removal of PII, e.g. SSN, name, and surname. However, because of the power of modern re-identification algorithms [20], removing PII data does not guarantee that the remaining data does not identify individuals. In order to make data sets compliant with the GDPR, we want to investigate the k-RDF-Neighbourhood Anonymity [14]. Non-functional requirements. Besides the functional requirements, the proposed approach has to address the following non-functional requirements: – reversibility of the actions in order to provide data providers the possibility to perform the undo of every action; – traceability of the actions. This requirement is inspired by the potential occurrence of several actors involved in the definition and maintenance of data sets under definition. Therefore, there is the necessity to keep track of the performed actions and their owner; – efficiency ; – scalability since the data could increase dramatically; – interactivity since the data publishing and quality improvement could in- volve several actors which have to work collaboratively. 8 Evaluation plan To assess our approach we plan to evaluate the scalability by considering data sets of increasing size. About the performances we will consider both the time and the space used. Moreover, we plan to evaluate the correctness by i) manually checking the results, ii) by using data set as a gold standard, iii) by comparing them with results obtained by other tools iv) or by evaluating the same metrics upon a different data format. For instance, the correctness calculated through the type inference approach described above can be validated against the value calculated upon the graph. To verify the usability and applicability of our ap- proach, we plan to involve SPOD users in order to check how it performs in real settings. The research described here is conducted in strict cooperation with our PA and their ICT department. Therefore, they are interested in testing our results and verify if they can be practically exploited in their every-day work. 9 Reflections To the best of our knowledge, quality aspects and privacy concerns are rarely managed simultaneously, both in OD and LOD. Therefore, our goal is to fill up this gap by proposing a framework which helps data providers and consumers in assessing and improving data quality, while preventing personal information leakage. Moreover, the features offered by this framework will be integrated into SPOD in order to reach a wider range of users both to help them in providing Data Quality Improvement of (Linked) (Open) Data 11 data of better quality and to test the applicability of our proposal. Since this is my first year of Ph.D., I plan to work on the assessment phase and study how privacy concerns can be managed by the end of this year. I will dedicate the next year to the quality improvement both studying the most reliable solutions and by developing our own approach. To avoid reinventing the wheel, each step is preceded by a study phase. I plan to reuse the most promising approaches used in literature and defining our own approach to feel any gap. The third - and last - year is dedicated to the evaluation and improvement of the proposed approach by the collected considerations. The main novelty of our approach is to provide a unique interface to manage both quality and privacy concerns. Acknowledgement I would like to thank my supervisor Prof. Vittoro Scarano for his support. References 1. DAMA international. The DAMA guide to the data management body of knowl- edge, https://dama.org/content/body-knowledge, last access April 15th, 2019 2. Ballou, D.P., Pazer, H.L.: Modeling data and process quality in multi-input, multi- output information systems. Manage. Sci. 31(2), 150–162 (1985) 3. Berners-Lee, T.: 5-star open data, https://5stardata.info/en/, last access 04- 2019 4. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translat- ing embeddings for modeling multi-relational data. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Infor- mation Processing Systems 26, pp. 2787–2795. Curran Associates, Inc. (2013) 5. Cochez, M., Ristoski, P., Ponzetto, S.P., Paulheim, H.: Global RDF vector space embeddings. In: The Semantic Web - 16th International Semantic Web Conference, Proceedings, Part I. pp. 190–207 (2017) 6. Dataguise: DGSecure (2018), https://www.dataguise.com/detect/, last access 01-2019 7. Donato, R.D., Garofalo, M., Malandrino, D., Pellegrino, M.A., Petta, A., Scarano, V.: Linked data queries by a trialogical learning approach. In: 23rd IEEE Interna- tional Conference on Computer Supported Cooperative Work in Design (2019) 8. Feeney, K., O’Sullivan, D., Tai, W., Brennan, R.: Improving curated web-data qual- ity with structured harvesting and assessment. International Journal on Semantic Web and Information Systems 10, 35–62 (2014) 9. Ferretti, G., Malandrino, D., Pellegrino, M.A., Pirozzi, D., Renzi, G., Scarano, V.: A non-prescriptive environment to scaffold high quality and privacy-aware produc- tion of open data with AI. In: Digital Government Society. Dg.O. (2019) 10. Fleischhacker, D., Paulheim, H., Bryl, V., Völker, J., Bizer, C.: Detecting errors in numerical linked data using cross-checked outlier detection. In: The Semantic Web - ISWC. pp. 357–372 (2014) 11. Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: A survey of recent developments. ACM Comput. Surv. 42(4), 14:1–14:53 (2010) 12. Fürber, C., Hepp, M.: SWIQA - a semantic web information quality assessment framework. In: ECIS Proceedings (2011) 12 Maria Angela Pellegrino 13. Hadhiatma, A.: Improving data quality in the linked open data: a survey. Journal of Physics: Conference Series 978, 12–26 (2018) 14. Heitmann, B., Hermsen, F., Decker, S.: k - RDF-neighbourhood anonymity: Com- bining structural and attribute-based anonymisation for linked data. In: Proceed- ings of the 5th Workshop on Society, Privacy and the Semantic Web - Policy and Technology co-located with 16th ISWC (2017) 15. Langer, A., Siegert, V., Göpfert, C., Gaedke, M.: Semquire - assessing the data quality of linked open data sources based on DQV. In: Current Trends in Web Engineering. pp. 163–175 (2018) 16. Li, N., Li, T., Venkatasubramanian, S.: t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE 23rd International Conference on Data Engineering (2007) 17. Liu, K., Terzi, E.: Towards Identity Anonymization on Graphs. Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (2008) 18. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.: L-diversity: privacy beyond k-anonymity. 22nd International Conference on Data Engineering (2006) 19. Mendes, P.N., Mühleisen, H., Bizer, C.: Sieve: Linked data quality assessment and fusion. In: Proceedings of the Joint EDBT/ICDT Workshops. pp. 116–123 (2012) 20. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: 2008 IEEE Symposium on Security and Privacy (sp 2008). pp. 111–125 (2008) 21. Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable in- formation”. Commun. ACM 53(6), 24–26 (2010) 22. Parliament, E.: General data protection regulation (2018), https://eur-lex. europa.eu/eli/reg/2016/679/oj, last access 04-2019 23. Paulheim, H.: Browsing linked open data with auto complete. In: Semantic Web Challenge (2012) 24. Ristoski, P., Paulheim, H.: RDF2Vec: RDF graph embeddings for data mining. In: The Semantic Web - 15th International Semantic Web Conference, Proceedings, Part I. pp. 498–514 (2016) 25. Simperl, E., Norton, B., Acosta, M., Maleshkova, M., Domingue, J., Mikroyannidis, A., Mulholland, P., Power, R.: Using Linked Data Effectively. The Open University, Milton Keynes (2013) 26. Sleeman, J., Finin, T.: Type prediction for efficient coreference resolution in hetero- geneous semantic graphs. In: IEEE Seventh International Conference on Semantic Computing. pp. 78–85 (2013) 27. Sweeney, L.: Simple demographics often identify people uniquely (2000), http: //dataprivacylab.org/projects/identifiability/ 28. Sweeney, L.: K-anonymity: A Model for Protecting Privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. (2002) 29. Wand, Y., Wang, R.Y.: Anchoring data quality dimensions in ontological founda- tions. Commun. ACM 39, 86–95 (1996) 30. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: A survey. Semantic Web 7(1), 63–93 (2016) 31. Zheleva, E., Getoor, L.: Preserving the Privacy of Sensitive Relationships in Graph Data. Privacy, Security, and Trust in KDD (2008) 32. Zhou, B., Pei, J.: Preserving Privacy in Social Networks Against Neighborhood Attacks. Proceedings of the 24th International Conference on Data Engineering, ICDE (2008) 33. Zhou, B., Pei, J., Luk, W.: A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data. SIGKDD Explor. Newsl. (2008)