A Scientific Research Recommendation System Based on Privacy-Preserving Training Dataset 1 Shaohua Liu1,*, Lu Lv2, Xiaoguang Su1, Gang Shen3 1 Department of Management Engineering and Equipment Economics, Naval University of Engineering, Wuhan, China 2 College of Life Sciences, South-Central Minzu University, Wuhan, China 3 School of Computers, Hubei University of Technology, Wuhan, China Abstract Scientific research recommendation system can provide the valuable reference for research- ers to choose topics and determine research direction. However, traditional scientific research recommendation obtains the model by training the behaviour dataset of researchers stored in the centre, which may lead to the disclosure of researchers’ sensitive information. In this pa- per, we propose a scientific research recommendation system based on privacy-preserving training dataset. Specifically, we use the federated learning mechanism and threshold homo- morphic encryption technology to make the scientific research recommendation model avail- able without uploading the raw dataset, which can protect the privacy of the researchers’ be- haviour dataset. Additionally, we also use a method to process the dataset of low-quality re- searchers to improve the accuracy of recommendation model. Through analysis, not only the researchers’ privacy can be protected, but also the recommendation model accuracy can be optimized. The experimental results show that the proposed scheme can satisfy the functional requirements of scientific research recommendation system. Keywords Scientific research recommendation, Privacy-preserving, Federated learning 1. Introduction With the massive growth of scientific research information data, scientific research recommenda- tion system will become a right-hand man for researchers to choose their own scientific research in- terests [1], [2]. Scientific research recommendation system can realize active recommendation by ana- lysing the interactive behaviour of researchers, and provide researchers with more accurate research directions and hot topics according to their research interests. Therefore, an excellent scientific re- search recommendation system needs to be trained through high-quality dataset. In general, the train- ing dataset of recommendation model comes from a large number of the behaviour dataset of re- searchers who access the system. Moreover, these behaviour dataset often reflect the researchers’ re- search interest and identity information. If this private information is leaked, it may have a negative impact on the lives of researchers. The traditional recommendation system centralizes the dataset to a central server for training. In this way, the centralized storage of researcher’s behaviour dataset on the server may lead to the risk of disclosure of private information. The reason is that the data information can be easily obtained by malicious third parties or untrusted cloud servers. In order to overcome these problems, many scholars have proposed to use federated learning mechanism to train the model [3-5]. This mechanism means that all data owners train the data locally and upload the trained gradient to the central server so that the raw data is not disclosed. However, there are still some obstacles to using federal learning methods to solve problems. On the one hand, adversary can obtain some researchers’ sensitive ICCEIC2022@3rd International Conference on Computer Engineering and Intelligent Control EMAIL: *Corresponding author: lshhmail@163.com (Shaohua Liu) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 40 information from the uploaded gradients. On the other hand, there are some unreliable researchers who have low-quality dataset [5], [6]. Since there may be great numerical difference between the gradient trained from low-quality dataset and the ideal gradient, the low-quality dataset affects the accuracy of recommendation model. To combat that, we propose a scientific research recommendation system based on privacy- preserving training dataset. First of all, we test the local gradient and rule out the unreliable ones. Of course, we will ensure that a certain number of dataset are used to train the recommendation model. The contributions of this article can be summarized as follows: ⚫ First, we propose a scientific research recommendation system based on privacy-preserving training dataset. This scheme uses federal learning mechanism and threshold homomorphic en- cryption technology to protect the researchers’ privacy. ⚫ Second, the proposed scheme can mitigate the negative impact of low-quality data caused by unreliable researchers. ⚫ Finally, we also conduct a large number of experiments to verify that the proposed scheme has better performance in terms of security and efficiency. The rest of this article is organized as follows. In Section 2, we describe the relevant primitives that this scheme needs to use. We introduce the system model and specific scheme in Section 3 and Section 4, respectively. We present the security and performance analysis in Section 5. Finally, we summarize the proposed scheme. 2. Preliminaries 2.1. (t, n) threshold Paillier cryptosystem In the proposed scheme, we use the (t, n) threshold Paillier cryptosystem [7] to realize the encryption of sensitive information. The advantage of threshold Paillier cryptosystem is that it not only has additive homogeneity, but also has threshold, that is, only those who are equal to or more than a certain number (i.e., t) of shares can obtain the decryption key. The cryptosystem includes the following algorithms: ⚫ Key generation: Choose two large primes p, q and calculate n = pq , and select a genear- tor g  Z n*s+1 . Then, the public key is pk = ( g , n s ) , the private key is si = f (i),1  i  n . ⚫ Encryption: Given a plaintext m, use a random r  Z n*s+1 to calculate the ciphertext c = g m r n mod n s +1 . s 2 si ⚫ Share decryption: Each private key share holder calculates its own share ci = c mod n s +1 , where  = n! . ⚫ Share combining: By using the Lagrange interpolation algorithm [7], the ciphertext c can be recovered by combining t shares of ci . The homomorphic property of the above algorithm is as follows: ( mi + m j ) c = E pk (mi + m j ) = g (ri rj ) n mod n s +1 s (1) = E pk (mi )  E pk (m j ) 2.2. Federated learning Traditional machine learning is to train dataset together, so that it is possible to leak the raw data to adversaries. To combat this privacy issue, Google first proposed a framework for federated learning in 2016, which allows distributed users to train locally without exposing their raw data. Federated learning is technology that uses distributed optimization methods to protect data privacy in multi- party cooperation [8]. It allows multiple clients to cooperate with each other under the coordination of a central server, and a complete machine learning model can be obtained even if the data is scattered among the clients. 41 Typically, federal learning consists of the following four steps: 1) All clients train on local data independently; 2) The client encrypts the trained gradient and uploads it to central server; 3) The central server aggregates all uploaded gradients securely; 4) The central server sends the global model to each client. 3. System model, threat model and requirements 3.1. System model As shown in Figure 1, the system model of the proposed scheme includes three entities, namely a trusted third party (TTP), a central server (CS) and researchers who provide dataset. Each researcher computes the local gradient by training his/her behaviour dataset locally, and then uploads the gradi- ent to CS. After that, CS aggregates all uploaded gradients to train a global research recommendation model. At the same time, the global model is fed back to each local researcher, and they train the new gradient according to the global. The above iteration does not end until the accuracy of the global model meets certain requirements. The entities in the system model are described as follows: Figure 1 System model We use cosine similarity to compare the correlation between local gradient and ideal gradient. The initial ideal gradient is a preset initial value. 3.2. Threat model and requirements In the proposed scheme, the threat model comes from external adversaries and internal adversaries. Central server can become internal adversary if corrupted by adversaries. It is possible to use its convenience to obtain the researchers’ behaviour information stored. Based on the given threat model, the requirement of the proposed scheme is to protect the privacy of gradient information provided by researchers, that is, the sensitive information will not be disclosed in the process of gradient transmission and storage. In addition, to improve the accuracy of recommendation system, unreliable participants should be screened before uploading the gradients. 4. The proposed scheme In this section, we introduce the proposed scheme. First, our scheme considers the problem of unreliable researchers, that is, local gradient generated by ith iteration must be compared with the ideal gradient of this round in order to improve the accuracy of the recommendation model. Then, the security of gradient in transmission and storage procedures is also considered. The specific scheme includes the following four phases: system initialization, processing of low-quality dataset researcher, local gradient encryption and generation of recommendation model. 42 4.1. System initialization TTA is responsible for initializing the system. Given a security parameter  , TTA generates a public key pk for all entities and assigns a set of private keys {sk1 , sk2 , , ski , , skI } to each researcher Ri . G* = {G0* , G1* , , Gi* , , GI*−1} is a global ideal gradient, which is generated by pre- training the scientific research recommendation model. Here, Gi* denotes the ideal gradient of the ith iteration. 4.2. Processing of low-quality dataset researcher Researchers with low-quality dataset should be screened before uploading local gradients. Otherwise, they will affect the accuracy of the scientific research recommendation system. Suppose G j = {G1j , G2j , , Gi j , , GIj } is the jth researcher’s local gradient, and i represents the number of iterations. Given the ideal gradient of ith iteration Gi* , each participating researcher compares its own gradient with it, as follows: Gi*−1  Gi j sim(Gi*−1 , Gi j ) = (2) Gi*−1  Gi j where, sim() is the cosine similarity algorithm. According to equation (2), the higher the value of sim() , the higher the reliability of local gradient. When the result of sim() is less than a certain value, it indicates that the researcher with the gradient is an unreliable participant. In the proposed scheme, it is assumed that I pariticipating researchers are needed to train the global recommendation system. Unreliable researchers are screened and new participants are reselected to ensure that a certain number of researchers come to train global model. The specific process is illustrated in Algorithm 1. 4.3. Local gradient encryption Each reliable participant j encrypts his/her gradient as c j = Enc pk (G j ) with a public key pk and then uploads it to the central server. After receiving all encrypted local gradients, the central server aggregates all encrypted gradients as follows: c = c1c2 ...cn = Enc pk (G1 ) Enc pk (G 2 )...Enc pk (G n ) = Enc pk (G1 + G 2 + ... + G n ) Algorithm 1: Processing of Low-quality Data Researcher Input: Global ideal gradient G * = {G0* , G1* , , Gi* , , GI*−1} , local gradient G j = {G1j , G2j , , Gi j , , GIj } , threshold TH Output: reliable gradient * 1: Initialize global ideal gradient G ; * 2: In jth iteration, given G ; 3: for i to I do 4: if Eqn. ( 2 )  TH then j 5: Return Gi ; 6: end if 7: end for 43 4.4. Generation of recommendation model The recommendation model is derived from the aggregate values of all upload gradients. Therefore, the central server needs t participants to use their private keys sk j to calculate the secret share 2 sk j cj = c mod n s +1 . Then, using t shares of c j , the plaintext of aggregated gradients can be restored by Lagrange interpolation algorithm [7]. 5. Security and performance analysis In this section, we will discuss the security and performance of the proposed scheme. Additionally, the experiments are based on MNIST database and carried out on an operating system with intel (R) Core (TM) i7-9750H and 8G RAM. 5.1. Security analysis The security of the proposed scheme focuses on how to protect the privacy of researchers who provide training dataset. Specifically, the participating researcher’s gradient is protected. Proof: In the proposed scheme, the researchers involved in the training do not send their raw data to the central server, but only trained locally. Therefore, adversary will not be able to obtain the raw information of the researchers. In addition, each trained gradient is encrypted by threshold Paillier cryptosystem as c j = Enc pk (G j ) . From Section 4, the decryption key needs to be recovered by at least t participating researchers, so it is very difficult for adversary and central server to obtain the decryption key. Therefore, our scheme can protect researcher’s dataset. 5.2. Performance analysis Here, we mainly discuss the computation cost of encryption and decryption, and the efficiency of low-quality user verification in the proposed scheme. In local gradient encryption phase, each researcher encrypts his/her gradient as c j = Enc pk (G j ) . And the recovery of global gradient is recovered in the model generation phase. Figure 2 Computation cost of encryption and decryption Figure 2 shows the computation cost of gradient encryption and decryption with the number of iterations. Next, we discuss the accuracy of the proposed scheme after considering the processing of low-quality dataset researchers. In order to better describe the experiments, we compare the proposed scheme with normal federated learning mechanism (NFM) scheme and filtered but not reselected federal learning mechanism (FBNRF) scheme in terms of accuracy. NFM refers to a federated learning scheme that does not deal with low-quality dataset researchers, and FBNRF denotes that low- quality dataset researchers have been screened but have not been reselected. 44 Figure 3 Accuracy As shown in Figure 3, since the ideal global gradient is not obtained, the accuracy in the three scenarios in the initial iteration is about 10%. However, after three iterations, we can see that the accuracy of FBNRF and our scheme is better than that of NFM. In the 9th iterations, the accuracy of the three scenarios is 72%, 84%, 95%, respectively. And after completing the number of iterations, the accuracy of our scheme is higher than that of the other two schemes. 6. Conclusion In this article, we propose a scientific research recommendation system based on privacy- preserving training dataset, we test the local gradient and rule out the unreliable ones and use the federated learning mechanism and threshold homomorphic encryption technology to make the scientific research recommendation model available without uploading the raw dataset. Security and performance analysis shows that the proposed scheme can meet the security and efficiency of dataset of scientific research recommendation system. In the future, we will study the privacy protection of query users and model parameters in the scientific research recommendation. 7. References [1] Nishioka C., Hauke J., and Scherp A.: Influence of tweets and diversification on serendipitous research paper recommender systems, Peerj Comput. Sci., vol. 6, pp. e273, 2020. [2] Zhou X., Liang W., Wang I. K., and Yang L. T.: Deep mining based on hierarchical hybrid net- works for heterogeneous big data recommendations, IEEE Trans. on Comput. Soc. Syst., vol. 8, no. 1, pp. 171-178, 2021. [3] Duan M., Liu D., Chen X., Liu R., Tan Y., and Liang L.: Self-balancing federated learning with global imbalanced data in mobile systems, IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 1, pp. 59-71, 2021. [4] Fang C., Guo Y., Hu Y., Ma B., Feng L., and Yin A.: Privacy-preserving and communication- efficient federated learning in internet of things, Comput. Secur., vol. 103, pp. 102199, 2021. [5] Li Y., Li H., Xu G., Huang X., and Lu R.: Efficient privacy-preserving federated learning with unreliable users, IEEE Internet of Things J., vol. 9, no. 13, pp. 11590-11603, 2022. [6] Hsieh K. et al.: Gaia: Geo-distributed machine learning approaching LAN speeds, in Proc. 14th USENIX Symp. Netw. Syst. Design Implement. (NSDI), 2017, pp. 629-647. [7] Damgard I., and Jurik M.: A generalization, a simplification and some applications of paillier’s probabilistic public-key system, in Proc. Int. Workshop Pract. Theory Public Key Cryptogr., 2001, pp. 119-136. [8] Lu Y., Huang X., Zhang K., Maharjan S., and Zhang Y.: Blockchain empowered asynchronous federated learning for secure data sharing in internet of vehicles, IEEE Trans, Veh. Technol., vol. 69, no. 4, pp. 4298-4311, 2020. 45