Data Mashups Privacy Preservation for Learning Analytics Mercedes Rodríguez García 1, Antonio Balderas 2 and Juan Manuel Dodero 2 1 Departamento de Ingeniería en Automática, Electrónica, Arquitectura y Redes de Computadores, Universidad de Cádiz, 11519 Puerto Real, Spain 2 Departamento de Ingeniería Informática, Universidad de Cádiz, 11519 Puerto Real, Spain Abstract The diversity of information sources available to educational institutions makes it necessary to mash up information in order to get the highest performance through learning analytics. Data mashup requires the implementation of data anonymisation methods in order to protect the privacy of the learners who appear in the data partitions. However, the process of anonymising this data mashup can lead to a loss of data utility. This paper presents a protocol for merging data mashups that preserves privacy by k-anonymising the data while preserving its analytical utility. Keywords 1 Learning Analytics, Data Mashup, Data privacy, K-anonymity. 1. Introduction Today, large datasets about students’ activity are available to educational institutions from a variety of sources [16]. These datasets collect important data on student performance and learning, but also contain demographic data. To integrate and compile information from all sources, current e-learning environments rely on data mashups, which offer a broader view of the learner through the exploitation of Learning Analytics (LA) [28]. The confidence of the education community is fundamental to the adoption of LA-based tools [13]. Mashing up information with personal content from a variety of sources is not welcome, as this may compromise individuals’ privacy. Even if unique identifiers that identify the information are removed, correlation through potentially identifiable attributes (quasi-identifiers) could assist in re-identification of the individual [24]. Therefore, the need for a protocol to anonymise data and guarantee the usefulness of learning data is fundamental. This paper presents and applies on a dataset of higher education students, a protocol to mashup data and then anonymise it without losing the statistical usefulness of the data [20]. 2. Background Data privacy is one of the biggest challenges in LA research [3]. The solutions that can be found in this context are based on approaches that prevent access to data by people who should not have access to it, either by defining role-based data access [4] or by storing information locally and avoiding cloud solutions [2]. Applying LA is essential if practitioners want a snapshot of their students’ learning process, since LA is fundamental to exploit the large amount of information from the learners’ work in the different virtual learning environments [17]. Thus, a mashup of the datasets contained in the partitions from different providers has to be performed while guaranteeing the learners’ privacy. 1 earning Analytics Summer Institute Spain (LASI Spain) 2022, June 20–21, 2022, Salamanca, Spain EMAIL: mercedes.rodriguez@uca.es (A. 1); antonio.balderas@uca.es (A. 2); juanma.dodero@uca.es (A. 3) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 39 The datasets provided can come from either vertical or horizontal partitioning. While in horizontal partitions, different datasets follow the same schema but store different users [15, 21]. In vertical partitions, different datasets store different sets of attributes of the same users (identified by a common attribute) [6]. Vertical partitioning is the typical configuration of datasets used to build next-generation of Virtual Learning Environments (VLEs). Databases to store and query e-learning data can be implemented with different storage techniques, including graph databases [22], e.g. RDF (Resource Description Framework) triple stores and relational databases [1]. Techniques used in previous work on vertically partitioned datasets achieve anonymisation by k- generalising the dataset [15, 21, 19, 9]. Generalisation techniques have the disadvantage that they either require high computational cost to find an optimal generalisation that minimises information loss [18], or they require an ad hoc taxonomic binary tree for each attribute to be anonymised [8]. It would be desirable to incorporate more practical k-anonymisation techniques in vertical data mashups, such as those based on microaggregation. With respect to LA, the way in which learner data is represented in VLE is critical to the performance of LA methods [26]. One of the main goals of FAIR (Findability, Accessibility, Interoperability, and Reusability) [27] and open data principles is to improve data representation by enriching metadata with multiple attributes. However, intelligent computing techniques such as machine learning have ethical and security issues that may be discordant with compliance with these principles [23]. Hence, when applied to the field of technology-enhanced learning, FAIR and open data principles can be an advantage for the support of human learning, as well as a risk to human privacy. The application of Privacy-by-Design (PbD) techniques is crucial for LA research and analytics in educational institutions. Given that current VLEs rely on data from cloud-based environments [16, 5], LA requires enhanced Privacy-Preserving Data Publishing (PPDP) methods capable of operating on data mashups, so that privacy constraints do not impose a limitation on LA solutions [10]. This research aims to address the problem that the PPDP solutions used for LA [14, 11] have not taken into account the actual mashup structure of current VLEs. For the sake of privacy-driven learning analytics, PPDP techniques have been limited to k-anonymity, since others as differential privacy have been proven to provide a worse balance between privacy and utility [12]. The limits and misuse of differential privacy regarding data publishing, which is the main purpose of this research, have been confirmed previously [7]. 3. Privacy Preserving Data Mashup Protocol In this section we present a protocol to mashup vertical data partitions from different data providers. The protocol consists of two phases: the setup protocol, and the anonymisation and integration protocol. In the first phase, the mashup coordinator identifies the data providers that could provide the data partitions to be used by the data consumer. While in the second phase, the data providers and the mashup coordinator anonymise and vertically integrate the data partitions to obtain the de-identified dataset. We assume that the vertical data partitions contain three types of attributes: identifying attributes, quasi-identifying attributes —whose combinations may be identifying if cross-referenced with other sources of information— and confidential attributes. 3.1. Setup Protocol As shown in figure 1, the mashup coordinator is responsible for initiating the setup protocol as soon as it receives a request from a data consumer. The mashup coordinator’s tasks include the following: 1. Identification of the providers that can contain the information required by the request. Providers publish their data schema, indicating: their identifying attributes, their quasi-identifying and confidential attributes. 2. Construction of the final mashup schema. This schema should include the identifier attribute that will be used for the join of the data partitions, the aggregate quasi-identifiers, the privacy level that will be applied to the aggregate quasi-identifiers and the set of confidential attributes. 3. Designation of the leading provider that will initiate the anonymisation and integration protocol. 40 This example aims to demonstrate how the setup protocol is implemented. To do this, we assume that the coordinator has received a request for information about the evaluations of a set of students along with their demographic data, and starts the setup protocol. First, the mashup coordinator identifies potential data providers. In this ex-ample the mashup coordinator will consider two providers. • Provider 1 (P 1): student demographic data that comes from an LMS database table (figure 2, left side). • Provider 2 (P 2): a LRS containing the assessments of a set of students in an activity (figure 2, right side). Figure 1: Setup protocol. Figure 2: List of demographic data obtained from the LMS database (left side) and record of a student in xAPI who has failed the activity (right side). Student from the xAPI record is linked to their database record. Second, the mashup coordinator builds the data mashup scheme. In this ex-ample, the coordinator uses the RDF view strategy described in [25] and defines the mashup name-space to map the linked 41 data attributes of the aforementioned schemes, as the linked data vocabularies, e.g. foaf and schema.org, might not be easily found or mapped to the attributes of the providers. Each tuple t in P1.demographic produces the following set of RDF triples: mup:student#t.student_id rdf:type foaf:Person For each tuple t in P1.demographic and each local QI attribute identifiable as such in P1, generate one RDF tuple. For each local QI attribute, the protocol follow the following strategy: • If a standard vocabulary exists to represent it, the attribute is mapped. For instance, gender. • If it does not exist, it is defined directly in the namespace (mup). For in-stance, disability. mup:student#t.student_id schema:gender mup:student#t.gender mup:student#t.student_id mup:disability mup:student#t.disability mup:student#t.student_id mup:age mup:student#t.age The mashup coordinator can also use foaf : age as a valid mapping instead of using directly mup : age adding the following triple: foaf:age owl:sameAS mup:age For each tuple t in P1.demographic and u in P2.activity such that t.student id = u.student id, a triple of the following structure is generated: mup:student#t.student_id mup:failed mup:student#u.activity_id Thirdly, the mashup coordinator chooses the leading provider so that the latter can initiate the integration and anonymisation protocol. 3.2. Anonymisation and Integration Protocol This protocol carries out the vertical integration of the data partitions identified in the setup protocol and the k-anonymisation of the aggregate quasi-identifier, which is built by vertically joining the quasi- identifier attributes of each partition. Privacy-preserving data collection and integration is achieved by decoupling the collection of quasi-identifiers from the collection of confidential data and by using what are known as privacy-preserving connectors (ppc) [20] —a pseudonym of that identifier attribute shared by all the vertical partitions. The ppc for a given record is computed as a collision-resistant hash function of the value that the identifier attribute holds in the record and a nonce common to all records. The nonce—one-time arbitrary number—is used to prevent reusing the connector and strengthen the connector against dictionary attacks. Two ppc are used in the protocol: one to integrate the data partitions received in the quasi-identifier collection, named Qppc, and another to integrate the data partitions received in the confidential data collection, named Cppc. This segregated collection of attributes contributes to anonymising data because it allows confidential attributes to be disassociated from quasi-identifiers and, thus, prevents the mashup coordinator from linking the original values of the quasi-identifiers with sensitive information. The anonymisation and integration protocol is summarised as follows: 1. The leading provider generates the nonces Qnonce and Cnonce used to build the privacy- preserving connectors. 2. The leading provider shares the nonces with the other data providers participating in the process by using a secure channel between communicating parties, such as TLS (Transport Layer Security). 3. Each provider derives the connectors Qppc and Cppc for each of the records in the partition. 42 4. Each provider sends the quasi-identifier attributes of its partition, along with the corresponding Qppc connectors, to the mashup coordinator via a secure channel. 5. The mashup coordinator vertically integrates the received data partitions through the connector Qppc to build the aggregate quasi-identifier. 6. The mashup coordinator initiates the anonymisation process of the aggregate quasi-identifier. Any PPDP method that satisfies k-anonymity, such as those based on aggregation or generalisation mentioned in Section 2, can be used to anonymise the quasi-identifier attributes. 7. The mashup coordinator sends the anonymised aggregate quasi-identifier to each data provider. Because the anonymisation of the quasi-identifiers has been delegated to the mashup coordinator, the data providers must make sure before reporting confidential information that the result satisfies the requirements of k-anonymity. 8. Each provider integrates the anonymised aggregate quasi-identifier with its confidential data through the connector Qppc. 9. Each provider sends its confidential data, along with the connectors Cppci and the anonymised aggregate quasi-identifier, to the mashup coordinator via a secure channel. 10. The mashup coordinator vertically integrates the received data partitions through the connector Cppci to yield the de-identified dataset provided to the data consumer. This dataset satisfies k- anonymity because at least k records share the same values in the aggregate quasi-identifier. 4. Conclusions This contribution has shown a new PPVD (Privacy-Preserving Vertical Data) protocol with the following features. • It serves requests for learning datasets from data consumers. • Identifies learning data sources, i.e. the different data providers that can satisfy a particular information request. • Vertically integrates learning data from different educational sources without revealing the learners’ identities referenced in the data. • Finally, it provides the resulting k-anonymised dataset to the data consumer. The protocol provides an effective integration of learning data and a PbD solu-tion for educational interoperable data architectures, while reconciling LA with privacy. The protocol can be used in any field of application beyond LA systems. 5. Acknowledgments Supported by the Spanish National Research Agency (AEI), through the project CRÊPES (ref. PID2020-115844RB-I00). 6. References [1] Ali, W., Yao, B., Saleem, M., Hogan, A., Ngomo, A.C.N.: Survey of RDF stores & SPARQL engines for querying knowledge graphs. TechRXiv (4 2021). https://doi.org/10.36227/techrxiv.14376884.v1 [2] Amo Filva, D., Prinsloo, P., Alier Forment, M., Fonseca Escudero, D., Torres Kom-pen, R., Canaleta Llampallas, X., Herrero Martın, J.: Local technology to enhance data privacy and security in educational technology. International journal of inter-active multimedia and artificial intelligence 7(2), 262–273 (2021) [3] Berg, A.M., Vrolijk, J., Mol, S.T., Fisher, A.: Anonymisation and synthetic data approaches to minimise privacy risks in highly scaled la intervention driven infras-tructure. In: Companion Proceedings 11th International Conference on Learning Analytics Knowledge (LAK21) (2021) 43 [4] Cesconetto, J., Augusto Silva, L., Bortoluzzi, F., Navarro-Caceres, M., Zeferino, C., R. Q. Leithardt, V.: PRIPRO-privacy profiles: User profiling management for smart environments. Electronics 9(9) (2020). https://doi.org/10.3390/electronics9091519 [5] Conde, M.A., Hernandez-Garcıa, A.: Data driven education in personal learning environments – what about learning beyond the institution? International Jour-nal of Learning Analytics and Artificial Intelligence for Education 1(1) (2019). https://doi.org/10.3991/ijai.v1i1.11041 [6] Domadiya, N., Rao, U.P.: Privacy preserving distributed association rule mining approach on vertically partitioned healthcare data. Procedia computer science 148, 303–312 (2019). https://doi.org/10.1016/j.procs.2019.01.023 [7] Domingo-Ferrer, J., Sanchez, D., Blanco-Justicia, A.: The limits of differential privacy (and its misuse in data release and machine learning). Communications of the ACM 64(7), 33–35 (2021). https://doi.org/10.1145/3433638 [8] Fung, B., Wang, K., Yu, P.: Top-down specialization for information and privacy preservation. In: 21st International Conference on Data Engineering. pp. 205–216 (2005). https://doi.org/10.1109/ICDE.2005.143 [9] Fung, B.C.M., Trojer, T., Hung, P.C.K., Xiong, L., Al-Hussaeni, K., Dssouli, R.: Service-oriented architecture for high-dimensional private data mashup. IEEE Transactions on Services Computing 5(3), 373–386 (2012). https://doi.org/10.1109/TSC.2011.13 [10] Griffiths, D., Drachsler, H., Kickmeier-Rust, M., Steiner, C., Hoel, T., Greller, W.: Is Privacy a Show-stopper for Learning Analytics? A Review of Current Issues and their Solutions. Learning Analytics Review, 2016, 6, 1–30, ISSN:2057-7494 [11] Gursoy, M.E., Inan, A., Nergiz, M.E., Saygin, Y.: Privacy-preserving learning an-alytics: Challenges and techniques. IEEE Transactions on Learning Technologies 10(1), 68–81 (2017). https://doi.org/10.1109/TLT.2016.2607747 [12] Joksimovic, S., Marshall, R., Rakotoarivelo, T., Ladjal, D., Zhan, C., Pardo, A.: Privacy-Driven Learning Analytics, pp. 1–22. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-86316-6 1 [13] Jones, K.M.: Learning analytics and higher education: a proposed model for estab-lishing informed consent mechanisms to promote student privacy and autonomy. International Journal of Educational Technology in Higher Education 16(1), 1–22 (2019) [14] Khalil, M., Ebner, M.: De-identification in learning analytics. Journal of Learning Analytics 3(1), 129–138 (4 2016). https://doi.org/10.18608/jla.2016.31.8 [15] Kim, S., Chung, Y.: An anonymization protocol for continuous and dynamic privacy-preserving data collection. Future Generation Computer Systems 93, 1065–1073 (4 2019). https://doi.org/10.1016/j.future.2017.09.009 [16] Ko, C.C., Young, S.S.C.: Explore the next generation of cloud-based e-learning environment. In: Chang, M., Hwang, W.Y., Chen, M.P., M¨uller, W. (eds.) Inter-national Conference on Technologies for E-Learning and Digital Entertainment. Lecture Notes in Computer Science, vol. 6872, pp. 107–114. Springer, Berlin, Hei-delberg (2011). https://doi.org/10.1007/978-3-642- 23456-9 20 [17] Martınez-Navarro, A., Moreno-Ger, P.: Comparison of clustering algorithms for learning analytics with educational datasets. International Journal of Interactive Multimedia and Artificial Intelligence 5(2), 9–16 (2018) [18] Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Pro-ceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. p. 223–228. PODS ’04, Association for Computing Machinery, New York, NY, USA (2004). https://doi.org/10.1145/1055558.1055591 [19] Mohammed, N., Fung, B.C.M., Wang, K., Hung, P.C.K.: Privacy-preserving data mashup. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. p. 228–239. EDBT ’09, Association for Computing Machinery, New York, NY, USA (2009) [20] Rodriguez-Garcia, M., Balderas, A., Dodero, J.M.: Privacy preservation and an-alytical utility of e-learning data mashups in the web of data. Appliec Sciences 11(18) (2021). https://doi.org/10.3390/app11188506 44 [21] Rodríguez-Garcia M., Cifredo-Cachon, M. A., Quiros-Olozabal, A.: Cooperative privacy- preserving data collection protocol base don delocalized-record chains. IEEE Access 8, 180738- 180749 (2020) [22] Sakr, S., Bonifati, A., Voigt, H., Iosup, A., Ammar, K., Angles, R., Aref, W., Are-nas, M., Besta, M., Boncz, P.A., Daudjee, K., Valle, E.D., Dumbrava, S., Hartig, O., Haslhofer, B., Hegeman, T., Hidders, J., Hose, K., Iamnitchi, A., Kalavri, V., Kapp, H., Martens, W., ¨Ozsu, M.T., Peukert, E., Plantikow, S., Ragab, M., Ri-peanu, M.R., Salihoglu, S., Schulz, C., Selmer, P., Sequeda, J.F., Shinavier, J.: The future is big graphs: A community view on graph processing systems. Communi- cations of the ACM 64(9), 62–71 (2021). https://doi.org/10.1145/3434642 [23] Sheth, A.: Internet of things to smart IoT through semantic, cognitive, and perceptual computing. IEEE Intelligent Systems 31(2), 108–112 (2016). https://doi.org/10.1109/MIS.2016.34 [24] U.S. Department of Education: Family Educational Rights and Privacy Act, 34 CFR 99 (FERPA). Online at https://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html [25] Vidal, V.M.P., Casanova, M.A., Cardoso, D.S.: Incremental maintenance of RDF views of relational data. In: Meersman, R., Panetto, H., Dillon, T., Eder, J., Bel-lahsene, Z., Ritter, N., De Leenheer, P., Dou, D. (eds.) On the Move to Meaningful Internet Systems Conference. Lecture Notes in Computer Science, vol. 8185, pp. 572–587. Springer, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41030-7 42 [26] Waheed, H., Hassan, S.U., Aljohani, N.R., Hardman, J., Alelyani, S., Nawaz, R.: Predicting academic performance of students from vle big data using deep learning models. Computers in Human Behavior 104, 106189 (2020). https://doi.org/10.1016/j.chb.2019.106189 [27] Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The fair guiding principles for scientific data management and stewardship. Scien-tific data 3(1), 1–9 (2016) [28] Wolff, A., Moore, J., Zdrahal, Z., Hlosta, M., Kuzilek, J.: Data literacy for learn-ing analytics. In: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge. pp. 500– 501 (2016) 45