Research on protein complex recognition method based on genetic algorithm Zhiqiang Liu1, ∗ and Yulan Tang1 1 Chongqing Sino Konjac Biotechnology Co., LTD, Chongqing, China Abstract In the rapid development of social economy and science and technology, how to study the interaction between proteins to complete the law of life activities has been one of the hot topics in academic research. Since protein interaction network is a typical complex network, which directly presents a significant social structure, and the corresponding functional modules are regarded as protein complexes, the corresponding identification and research methods have a positive role in predicting protein functions and explaining specific biological processes. Therefore, after understanding the research status of protein interaction network at home and abroad, this paper clarifies the protein complex recognition algorithm based on genetic algorithm according to the basic concept of protein interaction network, and then verifies the application value of the algorithm through experimental research, so as to provide reference for the application of the algorithm in the new era. Key words genetic algorithm; Protein complex; Interaction; Select strategy; Crossover strategy 1. Introducion Under the background of the comprehensive promotion of the human Genome Project, life science research has entered the post-genome era, and proteomics research is one of the key contents. As early as the early 1990s, some scholars proposed the basic concept of proteome, which refers to the collection of all proteins produced by a gene or variable RNA splicing and post-transcriptional modification of mRNA expressed by a cell or tissue. In other words, a collection of proteins expressed under certain conditions, at a certain moment and in a certain type of cell or organism. In the process of research and exploration, protein organization includes basic function, post-translational modification, protein identification and many other contents. Proteomics theory and related technologies can be used to accelerate the development of molecular medicine, truly serve ICCIC 2024: International Conference on Computer and Intelligent Control, June 29–30, 2024, Kuala Lumpur, Malaysia ∗ Corresponding author. liuzhiqiang@xidakonjac.onaliyun.com (Z. Liu); tangyulan@xidakonjac.onaliyun.com (Y. Tang) 0009 0000 8990509X (Z. Liu); 0009 0009 8290 7573 (Y. Tang) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings human life and health, and provide reference for the design of drugs that act on specific target molecules. Protein is a biological macromolecule that occupies a special position in the biological body. It is composed of a variety of amino acids according to dehydration condensation according to a specific method. After winding and folding of polypeptide chains, polymer compounds with a certain spatial structure can be formed, which is the basic condition of all metabolic activities. Because proteins in cells cannot exist independently and must interact with other proteins to achieve biological functions, protein interactions regulate most biological processes at the cellular level, such as DNA replication and transfer, signaling, and cell cycle regulation. In order to easily and quickly discover complexes from protein interaction networks, some scholars proposed to use graph representation, protein interaction network, protein belongs to the vertex, the interaction between proteins belongs to the edge, the use of a variety of graph clustering algorithms, intelligent optimization algorithms and complex network theory for mining analysis, not only can better understand the topological structure of protein networks, It can also clarify the biological significance of the complex, which will eventually help predict unknown protein functions and human disease genes in the new era [1]. In the 21st century, some scholars first proposed topological structures found in networks. Complex recognition has attracted extensive attention in the fields of data mining, bioinformatics and complex networks. They are connected together in a relatively close way to accomplish specific functions together. It has been found that in the protein network, the complex has strong biological unity, and the functional similarity of the proteins contained in the complex exceeds the functional similarity of other protein pairs in the network. Some scholars have also proved that the density inside the complex is relatively high, which is related to the biological unity, so it is of great significance to study the complex recognition algorithm in the biological network. Current scholars have proposed a cluster-based protein complex discovery method, which uses points or clusters as seed nodes to expand, regards density as distance targets, and identifies the cluster as a protein complex when the density value of the cluster is higher than a certain threshold. Some scholars have proposed an intelligent optimization algorithm represented by heuristics, which has gradually formed a competitive complex discovery method in practice. Compared with clustering algorithm, it is mainly used to solve the problem of social structure discovery in complex networks such as social networks. For example, in 2007, some scholars used genetic algorithm for community mining for the first time, and proposed a single-path crossover operation suitable for string coding. In order to solve the insufficient application of faction filtering algorithm (DPM), some scholars introduced the protein complex scale of distance constraint recognition, and proposed the protein complex recognition algorithm DPM-DR based on group penetration and distance constraint, which is more accurate and effective in practice and more comprehensive in content [2.3]. This is also the beginning of the application of multi-objective evolutionary programming ideas in protein interaction networks. According to the existing research results, the most common compound discovery algorithm of protein interaction network is still clustering algorithm, but there are still many challenges in practical application research. Therefore, this project mainly discusses the complex discovery method in protein interaction network from the perspective of food science and technology research in China, taking the protein interaction network of Saccharomyces cerevisiae as an example [4]. 1.1 Research status of protein complex The recognition of protein complex is a hot topic in bioinformatics. The accurate recognition of protein complex can help human understand the process of living organism. With the advancement of traditional biological experiments, researchers have obtained a large amount of PPI data, which provides a basis for the identification of unknown protein complexes. Based on this, many methods for protein complex identification based on PPI data have emerged. In this paper, some representative methods for predicting protein complexes are selected, and these domestic and foreign research methods are summarized in this section. At present, there are four traditional methods to identify protein complexes in biological experiments: mass spectrometry, protein microarray, X-ray crystallography and nuclear magnetic resonance. Although the traditional prediction method has certain advantages, such as high accuracy, can identify the three-dimensional structure of the protein complex, can observe the atomic level of detail, but the high cost of the experiment, long cycle and other disadvantages pose challenges to the implementation of the experiment. Protein microarray technology requires expensive reactants and reagents; The method of identifying protein complex based on X-ray crystallography requires high quality crystal as the experimental medium, and there may be some radiation damage during the experiment. NMR, on the other hand, may be disturbed by low-quality signals and may produce spectral overlap. To avoid these problems, the identification of protein complex [5] by PPIN became a "parallel" to traditional prediction methods. Compared with the above six methods, other protein complex recognition algorithms are less used. The typical ones are RSGNM method, which constructs PPIN through Poisson distribution and defines a series of parameters, and determines whether a protein belongs to protein complex by adjusting the parameters. The SGNMF method combines the Laplace operator of PPIN and the Kullback-Lcibler divergence to find protein complexes . In addition, influenced by heuristic algorithms, some swarm intelligence optimization algorithms are also used to identify protein complexes. F-MCL method combines Markov method and Firefly FA method to mine protein complex. ICSC is an improved cuckoo search clustering algorithm similar to F-MCL method, which is used to detect protein complex in weighted dynamic PPIN. In ICSC, a dynamic PPIN needs to be constructed first. Then the complex core structure of the subnetwork is found, and the accessory structure of the protein complex is obtained by clustering, so as to identify the protein complex. Unlike ICSC, MNC uses a multi-network model to mine protein complexes, the authors argue that traditional graph clustering algorithms only identify protein complexes by a single network, ignoring information from many other networks, while multi-network clustering models mine protein complexes by relationships between domain interactions. At present, the mainstream key protein recognition algorithms can be divided into the following five categories: 1. Methods based on biological experiments The most direct methods for identifying key proteins in biology-based experiments are conditional gene knockout, single gene knockout and RNA interference. Among them, gene knockout is a molecular biological technique that emerged after the 1980s. Based on the basic principle of DNA homologous recombination, gene knockout is realized by replacing the target gene fragment with the homologous fragment of gene. The principle of RNA interference to identify key proteins is that RNA induces rapid and specific degradation of homologous mRNA during evolution. Although the traditional method of identifying key proteins has high accuracy, it requires a lot of human, material and biological resources. With the development of high-throughput technology and the increasing amount of biological data, how to efficiently mine key proteins in massive PPins has become a research hotspot [6]. 2. Method based on topology structure Studies have shown that whether a protein is a key protein is related to the topological characteristics of protein nodes in PPIN. The method of identifying key proteins based on topology first needs to calculate the topological centrality of the protein nodes in the network. The larger the topology centrality value of nodes in the network, the higher the probability of node becoming a key protein. The "center-lethal" rule proposed by Jeong et al. shows that in PPIN, the greater the degree value of a node, the more important it is. A few years later, researchers used degree centrality (DC) to identify key proteins, with promising results. Since then, there have been many methods to identify key proteins by using topological centrality. Among them, the intermediate centrality (BC) is obtained by using the number of shortest paths passed by the node when calculating the topological centrality of a node. The information centrality (IC) calculates the length of the harmonic mean path to obtain the topological centrality value of the protein nodes in the network. Proximity centrality (CC) is to obtain the degree of intimacy between protein nodes and calculate the topological centrality of nodes. Subgraph centrality (SC) The number of simple paths taken by a node is calculated to determine topological centrality. Wang et al. proposed domain centrality (NC) based on edge aggregation coefficient to identify key proteins in 2011; In addition, eigenvector centrality (EC) and local mean junction centrality (LAC) are also used to identify key proteins. It is worth mentioning that Cytoscape is a software for biological network analysis and node centrality evaluation, making it easier to identify key proteins based on topological centrality. Although the identification of key proteins based on topological centrality is very accurate, the identification of key proteins will be greatly affected due to the large number of false positive data in the protein interaction data used [7]. 2.Method 2.1 Basic Concepts The protein interaction network can be viewed as a simple undirected graph, where node binding represents a collection of proteins and edits represent interaction binding between proteins. In this paper, the overlap score evaluation is used to measure the similarity between subgraphs. If subgraphs have more common nodes, the corresponding overlap score will be higher and higher. As one of the more common intelligent algorithms, genetic algorithm is a concept proposed by referring to biological genetic mechanism to find an optimal solution. The actual process is shown in Figure 1 below: [8.9] Figure 1: Flow chart of genetic algorithm. 2.2 Protein complex recognition algorithm based on genetic algorithm The complex identification analysis is carried out in the protein interaction network, that is, the graph clustering of the corresponding network, and finally a cluster corresponding to a protein complex is obtained. In essence, graph clustering refers to a NP-hard combinatorial optimization method, which uses heuristic search method to obtain satisfactory results within a specified time, and the selection of seed nodes and cluster expansion process is the key. Genetic algorithm, as one of the more common algorithms in current application research, will use population search to select, cross, mutation and other basic operations on the current population, so as to generate a new generation of population, and use population evolution to reach or approach the optimal solution. In this paper, the graph clustering method GAGC with genetic algorithm as the core is used for complex discovery. The specific steps are as follows: [10.11] First, objective function - individual fitness. As an important index of genetic algorithm, fitness is mainly used to evaluate the degree of superiority and inferiority of individuals in the population. The actual function value will affect the survival probability of each individual in the population, and will affect the performance and convergence speed of the algorithm. The algorithm studied in this paper takes F-measure as an individual fitness evaluation function, which includes two indicators: accuracy rate and recall rate. The former refers to the ratio between the compound number correctly identified by the algorithm and the compound number obtained by the algorithm, while the latter refers to the accuracy rate of identifying protein complexes in the standard database. Second, the initial population generation process. In the research process, factor model algorithm (IPCA) is used to construct a set of feasible graph structure, which is regarded as the initial population. The node weights of the network are defined first, and the node with the greatest weight is selected as the seed node and expanded into clusters. Only the node with the greatest weight is selected as the seed node for cluster expansion, which cannot guarantee the diversity of the initial population. Therefore, the IPCA algorithm should be improved. In the process of selecting seed nodes, the roulette mechanism is added, and the greater the weight of nodes, the higher the probability of seed nodes. In this way, not only can the selection of seed nodes be guaranteed to have a greater weight, but also can reflect the diversity of choices, and finally obtain multiple feasible solutions as the initial population [12]. Third, choose a strategy. In order to demonstrate the survival of the fittest concept of genetic algorithm, high-quality individuals should be allowed to participate in evolution, and individuals in the population should be selected to produce the next generation of individuals. At this time, individuals with higher fitness should be selected for optimization. If the fitness value is selected directly according to the order from high to low, the diversity of the population will be reduced, and the genetic algorithm will be premature. Therefore, in the algorithm research, roulette is used to select the probability of becoming the next generation of individuals in direct proportion to the fitness of nodes. At the same time, in order to ensure that there are optimal individuals in each generation of population, if the fitness value of the optimal individual is lower than that of the previous generation of optimal individuals after a new population for next generation optimization is generated, So you replace the worst of the current population with the best of the previous generation. Fourth, the crossover strategy. This step is the key content of genetic algorithm. If two chromosomes are directly selected to cross, there will be a large number of nodes without cluster belonging, which directly affects the clustering effect. In order to cross effectively, the genes in the two chromosomes are aligned according to the overlap score, so that the subgraph represented by the two alleles has more duplicate nodes. In this process, based on allelic cross operation, the connectivity of new subgraphs generated after gene cross can be ensured [13]. Fifth, the strategy of variation. In addition to inheriting the information of the parent's individual, the offspring produced by cross operation will also vary according to a certain probability, which fully reflects the diversity of biological heredity. In the process of mutation, a random mutation is first generated, corresponding to the gene on the chromosome [14]. 3. Result analysis 3.1 Experimental design In order to verify the application performance of the proposed algorithm, from the perspective of food science and technology in China, the yeast protein interaction network data was taken as the research object, and after obtaining the data set meeting the research requirements, the experimental analysis was carried out according to the process shown in Figure 2 below: Figure 2: Experimental flow chart Table 1 Data set and attribute analysis Protein Protein Network interaction Average Key protein Cluster network Number of Active edge number degree Qualitative coefficient vertices quantity MIPS 4 546 12 319 5.42 1 016 0.0879 DIP 5 093 24 743 9.72 1 167 0.0973 In order to further explore the accuracy and effectiveness of the applied algorithm, the ranking results of corresponding protein nodes should be obtained after the node weights assigned by the descending order algorithm, and protein nodes of different sizes should be selected as alternative key proteins for comparison and analysis with standard key protein data. It is possible to specify the number of candidate key proteins that are truly recognized as key proteins. 3.2 Experimental Results In the parameter setting and selection of the algorithm, experiments are conducted according to the above selected network set, the obtained node weights are processed according to the sorting-screening method, and the first 100, 200, 300, 400, 500 and 600 nodes in the sequence are selected as alternative key proteins. Thus, the number of true key proteins contained in the candidate key proteins can be determined, and finally the results shown in Table 2 below can be obtained: Table 2 Parameter selection and comparison results a MIPS network DIP network 100 200 300 400 500 600 100 200 300 400 500 600 0 48 82 123 163 213 255 46 82 115 158 201 252 0.1 52 97 130 178 222 267 69 103 144 186 236 278 0.2 61 105 147 186 230 275 77 132 181 217 260 301 0.3 66 118 163 199 239 278 83 145 193 241 283 324 0.4 65 126 172 209 249 288 82 145 199 253 300 349 0.5 67 127 180 226 265 298 81 149 203 257 305 349 0.6 67 127 186 234 279 312 79 147 207 259 302 347 0.7 68 131 187 238 281 320 75 147 208 260 304 352 0.8 68 134 186 234 284 327 79 149 208 256 308 351 0.9 67 133 184 229 279 331 75 149 209 262 305 349 After defining the algorithm flow, compared with the application performance of other algorithms, including DC, BC, EC, SC, LAC, NC, and UC, the final results show that the sensitivity, specificity, positive and negative prediction values of the proposed algorithm are significantly better than those of other algorithms, as shown in Table 3 below. In the future research and development of food science and technology, scholars should continue to combine other biological information, machine learning and deep learning algorithms to comprehensively improve the recognition efficiency and quality of key proteins, in order to provide a reference for the application of technical algorithms. Table 3 Performance comparison results of different algorithms Numeric set method SN SP PPV NPV F ACC DC 0.254 0.803 0.291 0.772 0.271 0.671 BC 0.197 0.796 0.278 0.716 0.231 0.629 EC 0.139 0.773 0.163 0.738 0.150 0.620 SC 0.138 0.773 0.162 0.739 0.149 0.620 MIPS LAC 0.271 0.812 0.314 0.779 0.291 0.682 NC 0.281 0.814 0.325 0.781 0.302 0.686 UC 0.271 0.812 0.314 0.778 0.291 0.682 EIC 0.426 0.853 0.480 0.824 0.452 0.750 DC 0.353 0.834 0.409 0.800 0.379 0.716 BC 0.308 0.823 0.361 0.785 0.333 0.700 EC 0.323 0.824 0.374 0.789 0.347 0.701 SC 0.316 0.822 0.366 0.787 0.339 0.698 DIP LAC 0.405 0.852 0.472 0.815 0.436 0.743 NC 0.400 0.850 0.463 0.813 0.428 0.739 UC 0.391 0.850 0.458 0.811 0.422 0.737 EIC 0.440 0.862 0.509 0.826 0.472 0.759 In this paper, protein networks are divided into 11 groups according to subcellular localization information. According to the above conclusions, we conclude that the distribution of key proteins in the organism is not uniform, and the number of key proteins is more in some regions. We expect to improve the recognition of complexes and, more importantly, key proteins by clustering protein networks by subcellular location. In order to prove the effectiveness of Ess-LGS method, this section selects a variety of methods to compare ACC, Precision, Recall and F1 values, and compares the performance of different algorithms on the same data set. Table 4 shows the clustering effect of protein complexes in 11 subcellular locations in this paper. Table 4 Experimental data for protein complexes at 11 subcellular locations Precision Recall F1 value ACC Endoplasmic 0.716 0.490 0.582 0.340 Cytoskeleton 0.590 0.770 0.670 0.330 Cytosol 0.532 0.648 0.584 0.219 Endosome 0.720 0.450 0.550 0.320 Extracellular 0.844 - - - Golgi 0.868 0.564 0.676 0.342 Mitochondrion 0.701 0.644 0.740 0.317 Nucleus 0.636 0.590 0.640 0.374 Peroxisome 0.844 0.323 0.428 0.404 Plasma - 0.733 0.784 0.294 Vacuole 0.794 0.500 0.608 0.316 4. Conclusion To sum up, in the face of higher requirements for food safety and biomedicine in the new era, with the continuous improvement of relevant technical theories, the development of industry technology is facing new opportunities and challenges. Especially for the identification of key proteins, although scholars from various countries have proposed and applied a large number of methods to predict key proteins, how to improve the accuracy of recognition has been the main problem puzzling scholars. In this paper, a protein complex recognition method based on genetic algorithm is proposed, and its performance is obviously better than other algorithms, so it is worth popularizing and applying in clinical field. Reference [1] P. Lu, J. Wei, “Key protein recognition algorithm based on complex structure and network topology,” Journal of Lanzhou University of Technology, vol. 002, pp. 048, 2022. [2] B. Zhong, J. Yang, Q. Yin, “Alloy process Optimization based on Neural Network and Genetic Algorithm,” Equipment Manufacturing Technology, vol. 7, pp. 272-273, 2023. (in Chinese) [3] Y. Tang, D. Li, Q. He, “Expression recognition method based on genetic algorithm and integrated pruning,” Journal of Sichuan Light Chemical University: Natural Science Edition, vol. 36, no. 5, pp. 67-75, 2023. (in Chinese) [4] J. Liang, Y. Zhang, D. Chun, “Frozen soil constitutive model based on GA and PSO algorithm parameter identification,” Journal of ice frozen soil, vol. 46 no. 1, pp. 235-246, 2024. DOI: 10.7522 / j.i SSN. 1000-0240.2024.0020. [5] R. Feng, Z. Chen, S. Yi, “Identification of maize varieties by SVM based on Bayesian optimization,” Spectroscopy and Spectral Analysis, vol. 006, pp. 042, 2022. [6] R. Li, Y. Chen, H. Wu, et al., “Analysis of damage and failure of reinforced concrete shed cave laid with sand-EPE composite cushion under rockfall impact,” Engineering Mechanics, vol. 41, pp. 1-16, 2024. [7] J. Zhang, C. Zhong, “Research progress of protein complex and functional module prediction algorithms based on protein interaction networks,” Guangxi Science, vol. 002, pp. 029, 2022. [8] Y. Ge, X. Zhao, “Two-step protein complex prediction method based on graph embedding and cuckoo search,” Journal of Ocean University of China (Natural Science Edition), vol. 007, pp. 052, 2022. [9] Z. He, H. Cao, F. Zhao, “Simulation of 10 kV Arrester deterioration state recognition based on Hybrid Genetic Algorithm,” Microcomputer Applications, vol. 11, pp. 69-72, 2023. [10] W. Chen, “Design of automatic facial expression image recognition system based on improved genetic algorithm,” Automation Technology and Application, vol. 42, no. 5, pp. 35-39, 2023. (in Chinese) [11] S. Lin, “Identification of sizing slice spun material based on Genetic algorithm and BP neural network,” Light Industry Science and Technology, vol. 6, pp. 125-129, 2023. (in Chinese) [12] Q. Zhang, “Handwritten digit recognition based on BP neural network optimization based on genetic algorithm,” Information and Computer, vol. 35, no. 7, pp. 75-77, 2023. (in Chinese) [13] H. Huang, T. Wu, W. Wang, et al., “Protein complex structure prediction: methods and advances,” Chinese Journal of Synthetic Biology, vol. 4, no. 3, pp. 507-523, 2023. (in Chinese) [14] L. Liang, W. Cao, L. Li, et al., “Research progress of Protein-polysaccharide non-covalent and covalent complex embedding active ingredients,” Food Science, vol. 44, no. 21, pp. 368-385, 2023.