=Paper=
{{Paper
|id=Vol-2699/paper44
|storemode=property
|title=Data Privacy in Journalistic Knowledge Platforms
|pdfUrl=https://ceur-ws.org/Vol-2699/paper44.pdf
|volume=Vol-2699
|authors=Marc Gallofré Ocaña,Tareq Al-Moslmi,Andreas L. Opdahl
|dblpUrl=https://dblp.org/rec/conf/cikm/OcanaAO20
}}
==Data Privacy in Journalistic Knowledge Platforms==
Data Privacy in Journalistic Knowledge Platforms Marc Gallofré Ocañaa , Tareq Al-Moslmia and Andreas L. Opdahla a University of Bergen, Fosswinckelsgt. 6, Postboks 7802, 5020 Bergen, Norway Abstract Journalistic knowledge platforms (JKPs) leverage data from the news, social media and other sources. They collect large amounts of data and attempt to extract potentially news-relevant information for news production. At the same time, by harvesting and recombining big data, they can challenge data privacy ethically and legally. Knowledge graphs offer new possibilities for representing information in JKPs, but their power also amplifies long-standing privacy concerns. This paper studies the implications of data privacy policies for JKPs. To do so, we have reviewed the GDPR and identified different areas where it potentially conflicts with JKPs. Keywords Privacy, Personal data, Journalistic Knowledge Platforms, GDPR 1. Introduction interest is not crystal clear. Data privacy has become a central topic of discus- Journalistic Knowledge Platforms (JKPs) are an emerg- sion for organisations and projects from private com- ing generation of platforms which combine state- panies and governments to research activities in uni- of-the-art artificial intelligence (AI) techniques, like versities around the globe. Whereas there is no general knowledge graphs and natural-language processing solution to privacy for everyone and specific solutions (NLP) [1, 2] for transforming newsrooms and leverag- vary between different countries, cultures and organi- ing information technologies to increase the quality sations, privacy is a common concern, which has been and lower the cost of news production. JKPs exploit discussed from the ethical and philosophical points of and combine news, social media and other informa- view by many different authors [6, 7] and organisa- tion sources, using linked open data (LOD), digital tions like the European Commission [8, 9]. The EU encyclopaedic sources and news archives to construct has established the General Data Protection Regula- knowledge graphs and provide fresh and unexpected tion (GDPR) which sets up a framework for governing information to journalists, helping them dive more the usage, processing, privacy and security of personal deeply into information, events and story-lines [3]. data, granting individuals power over their data and JKPs of various kinds are becoming increasingly im- making organisations responsible for data collection portant in leading news agencies like BBC [4] and and usage practices. Thomson Reuters [5]. Our group have been developing News Hunter [10, However, obtaining and representing knowledge 11, 12], a series of JKP architectures and prototypes. leads to data privacy concerns when personal data The current News Hunter platform is big-data ready from different sources is neither collected directly and designed to continually harvest and monitor real- from the subject nor with the subject’s consent, al- time news feeds (e.g., RSS or web-sites) and social though some countries have exemptions that loosen media (e.g., Twitter and Facebook). It aims to analyse privacy requirements for journalistic research that is and represent news content semantically in knowl- in the public interest or does not identify individuals edge graphs in order to provide better background directly. This exemption becomes even more complex information for journalists and to suggest news an- when the national privacy policies that apply to the gles [13, 14, 15, 16, 17]. data sources and the JKP are distinct or the public As part of our News Hunter effort, this paper inves- tigates the implications of the GDPR on JKPs. To do so, Proceedings of the CIKM 2020 Workshops, we asked ourselves which data privacy conflicts can October 19-20, Galway, Ireland. arise when JKPs when are used in journalistic work, email: Marc.Gallofre@uib.no (M. Gallofré Ocaña); Tareq.Al-Moslmi@uib.no (T. Al-Moslmi); Andreas.Opdahl@uib.no in particular when that work may be exempted from (A.L. Opdahl) some privacy regulations because it is in the public in- orcid: 0000-0001-7637-3303 (M. Gallofré Ocaña); terest. To the best of our knowledge, there is no previ- 0000-0002-5296-2709 (T. Al-Moslmi); 0000-0002-3141-1385 (A.L. ous work discussing the possible data privacy conflicts Opdahl) © 2020 Copyright for this paper by its authors. Use permitted under Creative in JKPs. Our contributions are: (1) we review different Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) journalistic scenarios and personal data sources that can conflict with GDPR policies, and (2) we introduce an autonomous person that states must protect to pre- a personal data matrix framework to classify personal serve a democratic society, the concept of privacy in data conflicts and discuss the possible uses of this ma- the United States of America is understood primarily trix. as a physical notion that implies the “private space” This paper is organised as follows: section 2 defines (e.g., bedroom, bathroom or the entire home) [6]. the main privacy concepts, section 3 discusses poten- These differences are reflected in the data privacy reg- tial data privacy conflicts in JKPs, section 4 introduces ulations of the EU and the USA. The EU states in the the personal data matrix framework, section 5 sum- General Data Protection Regulation (GDPR) [29] that marises the conclusions, and section 6 presents open individuals must be notified and have the right to con- questions and future work. sent when their personal data is collected from either inside or outside EU legislation. In contrast, the US only regulates privacy issues regarding health matters 2. Background and some financial information, leaving the rest to individual states or businesses which do not need to 2.1. Journalistic knowledge platforms ask for individuals consent and give the possibility Journalistic Knowledge Platforms (JKPs) leverage and to individuals to resign if they have any reservations combine news, multimedia content (e.g., TV news about what is being collected from them. On a global channels and podcast) social media (e.g., Twitter and scale beyond EU and the USA, differences in how pri- Facebook), web-blogs and information over the net, vacy is viewed are even bigger, making it even more using linked open data (LOD), digital encyclopaedic challenging to handle privacy regulations when JKPs sources (e.g., Wikipedia and Wikidata) and news are used in fully international news organisations that archives to provide fresh and unexpected information operate across cultural and legal domains. to journalists. Projects like Neptuno [18], Event Reg- istry [19], NEWS [20], NewsReader [21], SUMMA [22] 2.3. GDPR and News Angler [11, 16, 12] have presented examples of JKPs. All actions using or processing personal data of data A typical JKP comprises a knowledge graph [23, 24, subjects who are in the European Union shall obey 25] along with AI, NLP pipelines, and semantic tech- the General Data Protection Regulation (GDPR) [29]. nology components. In a JKP, the Knowledge graph The GDPR is an extensive regulation which sets the is filled with potentially news-related histories, in- basis for dealing with personal data in the EU or using formation and current and archival news to support personal data from the EU. This section highlights the journalists in creating newsworthy stories, finding most general concepts that restrict what and how to relevant information, events and story-lines, and val- process personal data in JKPs. idating and verifying news. The information in the The GDPR defines the concepts of personal data and knowledge graph is represented using standard iden- processing (Chapter I, Article 4). Personal data is any in- tifiers and semantic knowledge representations with formation that can be employed to identify directly or reasoning capabilities. The usage of standard identi- indirectly a natural person (e.g., name, an identifica- tion number or online identifier) or sensitive data like fiers facilitates data integration, which is the process of joining and merging different data sets or public health, biometric, genetic, economic, cultural factors data sources like Semantic Web [26, 27], linked open or political opinions of a natural person. Data that has data (LOD) [28], including Wikidata and DBpedia, and been de-identified, encrypted or pseudonymised but Wikipedia. Data integration together with reasoning can be used to re-identify a person is considered as allows drawing new insights from information from personal data too. By processing, the GDPR means pro- across the data that would be impossible before with cesses such as collection, structuring, storage, alter- isolated datasets. This inherent ability of drawing ation, consultation, use, disclosure, combination, re- new insights implies that new personal data may be striction, erasure or destruction of personal data. derived and exposed in the knowledge graph. Moreover, the GDPR establishes a set of principles for processing personal data (Chapter II ) which define how data have to be processed, stored and maintained. 2.2. Privacy These set of principles establish that data shall be pro- Privacy is a historically and culturally situated con- cessed within the initial purposes and purposes com- cept. For example, whereas privacy in Europe is tra- patible with them (purpose limitation), only what is ditionally considered as an inalienable basic right of necessary to the purpose (data minimisation), personal data shall be accurate and kept up to date (accuracy), not present a problem with the GDPR. However, the and stored for no longer periods than the necessary for data can be made it publicly accessible by the subject the purpose (storage limitation). It also defines the law- itself in social media networks (e.g., posts like tweets fulness of processing which determines when personal in Twitter or forums and groups like Facebook groups) data can be processed, e.g., when data subject gives or in the subject’s verifiable social media accounts and the consent or for a task carried out in public interest. personal web sites without providing explicit consent Under the GDPR, some research by journalist and aca- for its collection to the JKP. In that case, apart from demics is understood as public interest. Likewise, the having to follow the source’s data policies, it raises GDPR limits the processing of sensitive data which is the ethical questions whether the consent is implicit prohibited in general terms but with some exceptions, because it is publicly available, when we should con- e.g., when data subject gives explicit consent, it is nec- sider that it is publicly available and under which essary for reasons of substantial public interest, or the conditions. data subject has manifestly made it public. The GDPR also details when and which information 3.2. Personal data from third parties have to be provided to the data subjects (Chapter III ). In the case of personal data that is not obtained di- When the data is not collected directly from the sub- rectly from the subject, it determines which data have ject, instead, it has been made accessible by a third to be provided, e.g., the source of the personal data party and subject may ignore its existence, we have and whether it came from publicly accessible sources to consider two possible scenarios: or the categories of personal data. Nevertheless, it also The first scenario, when news-related information establishes some exemptions, e.g., when the provision is gathered from the web (e.g., online news, RSS, web- of such information proves to be impossible or is likely sites or social media), JKPs can extract personal data to harm the objectives of the processing objective. from the content to represent and combine it in the knowledge graphs. E.g., from “We know the classic 7- layer dip, made with Bush’s Beans, is a fan favorite for 3. Privacy conflicts in JKPs game day snacking celebrations, Kate Rafferty, the con- sumer experience manager for Bush’s, told Fox News.” 1 , When discussing which scenarios in JKPs can cause a we can extract information like “Kate Rafferty is a per- conflict with the GDPR we must consider the source son who works as consumer experience manager at of the personal data, distinguishing between the data Bush’s Beens company at Knoxville, Tennessee” which gathered directly from the subject, the data harvested can be considered as personal data, as it can be used to from other sources like news or social media and the identify a natural person. According to the GDPR, the inferred data. subject should be notified, and the JKP has to provide In the context of GDPR, some data processing by a mechanism for the subject to protest. Even though, journalists is exempted when it is conducted in the on a large scale, two issues arise: the number of notifi- public interest. However, this exemption exclusively cations that famous people will get and how to contact applies to journalistically relevant (newsworthy) per- subjects if the content information is missing. sonal data, not to any personal data processed in the The second scenario, when personal data is gath- JKP, and sensitive data may be less exempted or not ex- ered from publicly available sources or open sources empted at all. Therefore, we must also consider how like Wikipedia, Wikidata or telephone/address books, relevant the personal data is for the public interest it is clear that the personal data is already public. How- from a news perspective. This includes the assess- ever, it may not be released with subject’s consent. In ment of newsworthiness [30] along with the type of that case, it opens the question about: why should the news. E.g., a corruption scandal and a private event in JKP not be allowed to store copies of personal data the life of a famous person may both be highly news- which is already public? worthy, but corruption is most likely more important for the public interest. 3.3. Inferred personal data 3.1. Personal data from the subject When personal data is not gathered from any source, instead, it is inferred using the actual data (either from When personal data comes directly from a subject and the data subject, collected from news or gathered from is collected with the subject’s explicit consent (e.g., 1 Drew Schwartz, VICE: This 70-Layer Bean Dip Is the Most Vile personal data collected during an interviewed), it does Thing I’ve Ever Seen (https://t.co/qKyyNevpBh) public sources) and reasoning techniques. E.g., from Table 1 the text “The European Court of Justice (ECJ) said that Personal data matrix Oriol Junqueras had become an MEP the moment he Consented Collected Inferred was elected in May, despite being on trial for sedition.” 2 , Impersonal Data ✔ ✔ ✔ we can represent the person “Oriol Junqueras” as the Personal Data ✔ ! ! entity Q116812 from Wikidata, from which we can Sensitive Data ✔ !! !! derive that he is a member of a political party (P102) and the political party is “the Republican Left of Cat- alonia” (Q150068). With this information is it possible 4. Personal data matrix to infer the subject’s political ideology (P1142) from the political party information such as “republican- After reviewing the previous scenarios, we classi- ism” (Q877848) and “Catalan pro-independence move- fied the different situations that can cause a conflict ment” (Q893331). In this scenario there is not a direct with the GDPR into a two-dimension matrix (figure 1) source of subject’s personal data or political opinions, framework. The personal data matrix aims to help instead, there is a source of related information used journalists and JKP developers to classify the personal for inferring knowledge which can be either in the data in JKPs and its possible issues with privacy poli- same knowledge graph or from external sources. cies. The personal data matrix (figure 1) classifies per- sonal data based on the privacy level and the data 3.4. Possible solutions source. The first dimension (privacy level) classifies To comply with the GDPR’s Chapter III, in any of the the data whenever it does not represent personal data previously discussed scenarios it is important to iden- (impersonal data), it represents personal data or it rep- tify the data source and personal data category (e.g., resents “sensitive data”. There is an explicit distinction name, ID number, online identifier, health data, po- between personal and “sensitive data” because in the litical opinion). Thus, it will be possible to identify GDPR “sensitive data” have much more restricted lim- both the source and data and take actions accordingly. itations. The other dimension, the data source dimen- Although the main responsible of complying with the sion, classifies data based on the data collected with GDRP in the first place is the data provider (i.e., news the subject’s consent (consented), the data directly website, social media platform or telephone/address collected from the content and the inferred data. Only books), JKPs should follow the GDPR to safeguard the when data is either explicitly consented or it is not subjects of privacy and consider the policies and re- personal data its treatment is straightforward. Oth- strictions established by the data provider. The JKP erwise, as discussed in the previous section, each of must always take independent responsibility for pri- these combinations has its issues and open questions vacy, and it cannot trust its sources to safeguard pri- regarding the application of the GDPR and its origin. vacy. In a truly international and global set-up, where The data matrix can be also regarded as a cube, different privacy policies apply, JKPs may have to be where the public interest represents the third dimen- designed with different knowledge graphs for differ- sion. This third dimension determines to what extent ent legal domains or geographical regions, each graph the GDRP exemption to data processing for journalis- only being accessible from its own privacy domain. tic purposes in the public interest applies, taking into When this is infeasible, the most restrictive policies to consideration the newsworthiness component of the guarantee personal data privacy must be adopted. data. Moreover, JKPs should also implement automatic The proposed matrix can be used by JKP researchers mechanisms to notify subjects with both the personal and developers to ensure – as automatically as possi- data and the sources when this information is iden- ble, but in practice aided by human data privacy stew- tified, a process that can be done by email. It is also ards – that privacy regulations are never violated. The possible to set up an automatic system for subjects to matrix should be used in the design of JKPs to ensure protest, complain, request or ask about personal data. that personal data is protected by default. E.g., devel- opers of JKPs can use the matrix to evaluate the sys- tem and identify which processes or collected data can lead to privacy conflicts; implement the matrix as part of the news creation workflow so that journalists can 2 BBC: Jailed Catalan leader ’should have had immunity’, rules automatically check data privacy compliance before EU court (https://www.bbc.com/news/world-europe-50808766) collecting, re-combining or using any personal data; it can be utilized as metadata for each piece of data in Acknowledgments the knowledge graph to automatise its recognition and privacy assurance; and the matrix can be used when Supported by the Norwegian Research Council IKT- dealing with data under different regulations to find PLUSS project 275872 News Angler, which is a collab- divergences between them. oration with Wolftech AB, Bergen, Norway. 5. Conclusion References JKPs need to deal with personal data which in many [1] T. Al-Moslmi, M. Gallofré Ocaña, Lifting news cases will be integrated into knowledge graphs with- into a journalistic knowledge platform, in: Pro- out the explicit consent from the subject. Thus, JKPs ceedings of the CIKM 2020 Workshops, Galway, need to safeguard data privacy. For that reason, we Ireland, 2020. To appear. have presented a framework for classifying personal [2] T. Al-Moslmi, M. Gallofré Ocaña, A. L. Opdahl, data in journalistic knowledge graphs and identified C. Veres, Named entity extraction for knowl- different scenarios and personal data sources that po- edge graphs: A literature overview, IEEE Access tentially can conflict with the GDPR. We believe the 8 (2020) 32862–32881. identified scenarios, sources and presented matrix will [3] M. Gallofré Ocaña, A. L. Opdahl, Challenges and be helpful as a reference for related projects and simi- opportunities for journalistic knowledge plat- lar domains. forms, in: Proceedings of the CIKM 2020 Work- shops, Galway, Ireland, 2020. To appear. [4] Y. Raimond, M. Smethurst, A. McParland, 6. Future work C. Lowis, Using the past to explain the present: Interlinking current affairs with archives via the We want to continue exploring the open questions semantic web, in: H. Alani, L. Kagal, A. Fokoue, highlighted in our discussions in section 3, as well as P. Groth, C. Biemann, J. X. Parreira, L. Aroyo, questions such as how to deal with different privacy N. Noy, C. Welty, K. Janowicz (Eds.), The Se- regulations that may apply in international settings, mantic Web – ISWC 2013, Springer Berlin Hei- how to represent and effectively use GDPR in JKP delberg, Berlin, Heidelberg, 2013, pp. 146–161. processes, and how to deal with personal data about doi:10.1007/978-3-642-41338-4\_10. children. Data linking transparency is another open [5] B. Ulicny, Constructing knowledge graphs with question which would help to identify situations that trust, in: METHOD 2015: The 4th International conflict with privacy and identify which data can be Workshop on Methods for Establishing Trust of stored and which data cannot be stored in JKPs ac- (Open) Data, Bethlehem, PA, 2015. cording to the GDPR and other privacy policies and [6] C. Ess, Digital Media Ethics, Polity, 2014. regulations. Besides that, as anonymisation, encryp- [7] D. G. Johnson, Computer Ethics, Prentice Hall, tion and blockchain technologies are presented as 2001. potential solutions to safeguard privacy and control [8] European Parliament, Regulation (EU) no copyrights and data access, we want to research how 1291/2013 of the european parliament and of the effective they are in the context of JKPs and how they council of 11 december 2013 establishing horizon can benefit JKPs. 2020 - the framework programme for research Apart from that, one critical aspect when dealing and innovation (2014-2020) and repealing deci- with data from external sources, which has not been sion no 1982/2006/ECText with EEA relevance considered in this work, is the copyright and intellec- (2013). tual property regulations which have a direct relation [9] European Commission, Policy | science with the data that can be processed and stored. In this with and for society - research and inno- context, we want to explore how to effectively manage vation - european commission, 2019. URL: them in JKPs (e.g., using ontologies). http://ec.europa.eu/research/swafs/index.cfm? pg=policy&lib=ethics. [10] A. Berven, O. Christensen, S. Moldeklev, A. Op- dahl, K. Villanger, News hunter: building and mining knowledge graphs for newsroom sys- tems, in: NOKOBIT—Norsk konferanse for organisasjoners bruk av informasjonsteknologi, A. Palmero Aprosio, G. Rigau, M. Rospocher, volume 26, 2018. R. Segers, NewsReader: Using knowledge re- [11] M. Gallofré Ocaña, L. Nyre, A. L. Opdahl, sources in a cross-lingual reading machine to B. Tessem, C. Trattner, C. Veres, Towards a big generate more knowledge from massive streams data platform for news angles, in: 4th Norwegian of news, Knowledge-Based Systems 110 (2016). Big Data Symposium (NOBIDS) 2018, 2018. doi:10.1016/j.knosys.2016.07.013. [12] A. Berven, O. Christensen, S. Moldeklev, A. Op- [22] U. Germann, P. v. d. Kreeft, G. Barzdins, A. Birch, dahl, K. Villanger, A knowledge graph platform The summa platform: Scalable understanding of for newsrooms, Computers in Industry (2020). To multilingual media, in: Proceedings of the 21st appear. Annual Conference of the European Association [13] B. Tessem, A. L. Opdahl, Supporting journalistic for Machine Translation, 2018. news angles with models and analogies, in: 2019 [23] A. Singhal, Introducing the knowledge 13th International Conference on Research Chal- graph: things, not strings, 2012. URL: lenges in Information Science (RCIS), IEEE, 2019, https://googleblog.blogspot.com/2012/05/ pp. 1–7. doi:10.1109/RCIS.2019.8877058. introducing-knowledge-graph-things-not.html. [14] A. L. Opdahl, B. Tessem, Towards onto- [24] Ehrlinger, Lisa and Wöß, Wolfram, Towards a logical support for journalistic angles, in: definition of knowledge graphs, in: Joint Pro- Enterprise, Business-Process and Infor- ceedings of the Posters and Demos Track of the mation Systems Modeling, Springer In- 12th International Conference on Semantic Sys- ternational Publishing, 2019, pp. 279–294. tems - SEMANTiCS2016 and the 1st Interna- doi:10.1007/978-3-030-20618-5\_19. tional Workshop on Semantic Change & Evolv- [15] B. Tessem, Analogical news angles from text ing Semantics (SuCCESS’16) co-located with the similarity, in: Artificial Intelligence XXXVI, 12th International Conference on Semantic Sys- Springer International Publishing, 2019, pp. 449– tems (SEMANTiCS 2016), 2016, p. 4. URL: http: 455. doi:10.1007/978-3-030-34885-4\_35. //ceur-ws.org/Vol-1695/paper4.pdf. [16] A. L. Opdahl, B. Tessem, Ontologies for finding [25] J. Yan, C. Wang, W. Cheng, M. Gao, A. Zhou, journalistic angles, Software and Systems Mod- A retrospective of knowledge graphs, Fron- eling (2020) 1–17. tiers of Computer Science (2016). doi:10.1007/ [17] E. Motta, E. Daga, A. L. Opdahl, B. Tessem, Anal- s11704-016-5228-9. ysis and design of computational news angles, [26] T. Berners-Lee, J. Hendler, O. Lassila, et al., The IEEE Access (2020). semantic web, Scientific American 284 (2001). [18] P. Castells, F. Perdrix, E. Pulido, R. Mariano, [27] N. Shadbolt, T. Berners-Lee, W. Hall, The se- R. Benjamins, J. Contreras, J. Lorés, Neptuno: Se- mantic web revisited, IEEE Intell. Syst. 21 (2006) mantic web technologies for a digital newspaper 96–101. archive, in: European Semantic Web Symposium, [28] C. Bizer, T. Heath, T. Berners-Lee, Linked data: Springer, Berlin, Heidelberg, 2004, pp. 445–458. The story so far, in: Semantic services, interoper- [19] G. Leban, B. Fortuna, J. Brank, M. Grobelnik, ability and web applications: emerging concepts, Event registry: Learning about world events IGI Global, 2011, pp. 205–227. from news, in: Proceedings of the 23rd Interna- [29] The European Parliament and The Council of the tional Conference on World Wide Web, WWW European Union, Regulation (eu) 2016/679 of ’14 Companion, ACM, New York, NY, USA, 2014, the european parliament and of the council of 27 pp. 107–110. doi:10.1145/2567948.2577024. april 2016 on the protection of natural persons [20] N. Fernández, J. M. Blázquez, J. A. Fisteus, with regard to the processing of personal data L. Sánchez, M. Sintek, A. Bernardi, M. Fuentes, and on the free movement of such data, and re- A. Marrara, Z. Ben-Asher, News: Bringing pealing directive 95/46/ec (general data protec- semantic web technologies into news agencies, tion regulation) (text with eea relevance), Offi- in: I. Cruz, S. Decker, D. Allemang, C. Preist, cial Journal of the European Union (2016). URL: D. Schwabe, P. Mika, M. Uschold, L. M. Aroyo http://data.europa.eu/eli/reg/2016/679/oj. (Eds.), The Semantic Web - ISWC 2006, Springer [30] T. A. A. Al-Moslmi, M. Gallofré Ocaña, A. L. Op- Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. dahl, B. Tessem, Detecting newsworthy events 778–791. in a journalistic platform, in: The 3rd European [21] P. Vossen, R. Agerri, I. Aldabe, A. Cybulska, Data and Computational Journalism Conference, M. van Erp, A. Fokkens, E. Laparra, A.-L. Minard, 2019, pp. 3–5.