Analysis of the Evolution and Causes of Educational Involution Based on Prisoner's Dilemma and Reinforcement Learning 1 Yicheng Gong, Yanli Xu*, Qing Liu, Yuqiang Feng Faculty of Science, Wuhan University of Science and Technology, Wuhan, Hubei, China Abstract The severe situation of adolescent growth is often attributed to vicious competition leading to educational involution. In this paper, the frequency of "focus on scores" is increased and the return is significantly reduced as a sign of educational involution, and an educational game is constructed to analyze the choice and balance of "focus on scores" and "focus on happiness" for "home-school students". Because education is not easy to experiment, in order to reveal the evolution process of educational games in reality, Q-learning was used to conduct 10,000 sim- ulations. The results show that in the early stage, the frequency of "focus on scores" increased slowly and was lower than 50%, but the return did not decrease significantly and involution did not form; in the mid-term, the frequency of "focus on scores" increased rapidly and the return of those with higher frequency decreased significantly, and involution was formed; in the later stage, the return of those who entered the involution first dropped to the bottom, but then the frequency of involution and returns are overtaken and slowly rising; finally, the fre- quency of "focus on scores" converges at 70% to 82%, and involution is deadlocked. Therefore, in order to avoid educational involution, the frequency of "focus on scores " is preferably be- tween 35% and 45%. Key Words: youth education; involution; Prisoner's Dilemma; Reinforcement Learning; Q-learning 1. Introduction With the continuous development of higher education in China, youth education has been greatly popularized. In 2020, China's nine-year compulsory education rate has been consolidated at 95.2%, and the gross enrollment rate of higher education has reached 54.4% [1]. However, in recent years, there have been many problems in adolescent growth, among which mental health problems have become very serious:《Report on National Mental Health Development in China (2019-2020) 》shows that 24.6% of adolescents are depressed, of which 7.4% are severely depressed [2]. This problem has attracted wide- spread attention in the society, and education has become an important breakthrough in improving the mental health of adolescents: on the one hand, because most of them are students, and the students spend about 40 weeks in school each year; on the other hand, the vicious competition of educational subjects has become the reason for the high exposure rate of children's psychological problems. Vicious competition is mainly reflected in the high academic pressure, and the high academic pressure is mainly due to the dual pressure of the amount of homework and the difficulty of homework for young people. The phenomenon of vicious competition in education is similar to involution: under the premise that the edge is fixed, it turns to the pursuit of internal refined development due to the depletion of creativity [3][4] . To improve the problems of adolescent growth, we can start from education, clarify the evolution- ary laws and causes of educational involution, and then seek solutions. At present, many experts and scholars have analyzed the causes of education involution from the four aspects of society, government, AHPCAI2022@2nd International Conference on Algorithms, High Performance Computing and Artificial Intelligence EMAIL: email:gongyicheng@wust.edu.cn (Yicheng Gong), * Corresponding author:xuyanli@wust.edu.cn (Yanli Xu), email: liuqing@wust.edu.cn (Qing Liu), email:yqfeng6@126.com (Yuqiang Feng) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 19 family and school. They believe that the social reasons are mainly due to the large gap between the rich and the poor, class solidification, and uneven distribution of social resources, etc.; the main reasons for the government are the single education evaluation system, the problem of high school entrance exam- inations, and the uneven distribution of educational resources, etc.; the main reasons for schools are limited teachers and problems with educational concepts, etc.; the main reasons for the family are the high demand for education and the existence of problems in the concept of education, etc. [5-9]. Experts and scholars put forward solutions based on these reasons, but did not discuss the problems in the in- teraction of educational subject strategies that lead to the formation of educational involution. The vicious competition of educational subjects is an important reason for the formation of educa- tional involution, and vicious competition shows that there is a problem in the way of strategic interac- tion of educational subjects, breaking the balance between the two strategies of "focus on happiness" and "focus on scores", making "focus on happiness" gradually shift to " focus on scores", but individuals face the result that "return effort ratio" both decline. The educational dilemma caused by this educational game is the same as that of the prisoner's game [10]: individual rationality leads to collective irrationality. Therefore, this paper chooses the game perspective analysis, builds a two- player educational game based on the prisoner's game, and analyzes its theoretical equilibrium situation and the causes of invo- lution. In order to test the rationality of the theoretical analysis in this paper, a game experiment is needed. Education is related to the growth of young people and the future of the country. It is not toler- ated to do game experiments easily. At the same time, people in real life have limited rationality. There- fore, reinforcement learning is considered for simulation experiments. Reinforcement learning [11] has achieved excellent results in many practical applications, such as games, robot control, finance, medi- cine, resource optimization scheduling, Industrial Process Control et al. Therefore, this paper uses the Q-learning algorithm in reinforcement learning, and regards the players in the educational game as agents, so that they can learn by trial and error in a simulated educational environment, and conduct simulated experiments on the educational process. Taking the high frequency of " focus on scores" and decreasing returns as the main signs of the formation of educational involution, this paper analyzes the evolution and causes of Chinese youths' educational involution, and explores the optimal strategy choice of bounded rational people and self-balance of "focus on happiness" and "focus on scores", to develop recommendations for improving educational content. 2. Game Analysis of Educational Involution from the Perspective of Prison- er's Dilemma The important reason for the formation of educational involution is the vicious competition of edu- cational subjects. Vicious competition indicates that there is a problem in the way of strategic interaction of educational subjects. Therefore, the educational involution can be analyzed from the perspective of game theory. The result of the educational game is that the individual faces a decrease in the " return effort ratio", which is the same as the essence of the prisoner's game: individual rationality leads to collective irrationality, so the prisoner's dilemma model can be used to analyze the evolution and causes of educational involution. Since the educational game in reality is particularly complex, in order to simplify the analysis and make the educational game more reasonable, the following five model as- sumptions are made. 2.1 Model assumptions Hypothesis 1: Combine the parents, schools and students in the education subject into a rational person "home-school students". Hypothesis 2: The short-term goal of the "home-school students" in the game is to maximize the score. Hypothesis 3: There are many rational players playing games with each other in the whole education. In order to simplify the complexity of the model, two players "Home School Student A" (hereinafter abbreviated as HSSA) and "Home School Student B" (hereinafter abbreviated as HSSB) are selected to construct an educational game. 20 Hypothesis 4: Considering that people gradually change from "focus on happiness" to "focus on scores" in the educational game, it is assumed that in the educational game, HSSA and HSSB have two strategic choices: "Focus on Happiness" (hereinafter abbreviated as FH) and “Focus on Score” (herein- after abbreviated as FS). Hypothesis 5: When two rational people choose FH at the same time, the return obtained is R; when they choose FS at the same time, the return obtained is P; when they choose FH and FS respectively, the return obtained by the rational person who chooses FH is S, and the return obtained by the rational person who chooses FS is T. At the same time, the relationship between the four returns satisfies the following five inequalities: Inequality 1: R>P, if both parties choose FH at the same time, the returns are higher than if both parties choose FS at the same time. Inequality 2: R>S, if both parties choose to FH at the same time, it is more profitable than if the other party chooses to FS. Inequality 3: 2R>S+T, compared with only one party choosing FH, the returns of both parties choosing FH at the same time are higher. Inequality 4: T>R, only one party chooses to FS, this party can obtain the highest returns. Inequality 5: P>S, when one party chooses FS, the other party chooses to FS with higher returns than FH. 2.2 Model establishment and theoretical analysis Based on the above assumptions, the two- player education game model is constructed as shown in Table 1. Table1. Two-player education game model HSSB HSSA FH FS FH (R,R) (S,T) FS (T,S) (P,P) Analysis of Table1 shows that in the single two-player education game model, the Nash equilibrium strategy of HSSA and HSSB is FS, and the return is P; but the strategy that achieves Pareto optimality is HSSA and HSSB choose FH, and the return obtained is R. In the single two-player education game model, the players in the game can only reach the Nash equilibrium and cannot obtain the maximum return. When people chose FS, it will lead to the formation of educational involution. The educational dilemma caused by this educational game is the same as that of the prisoner's game: individual ration- ality leads to collective irrationality. Therefore, the existing conclusions and research results of the pris- oner's game can be used to assist the discussion of the educational game. Real education is continuous, so it can be seen as a repeated educational game, similar to the repeated prisoner's game. Theoretically, when the game is played repeatedly, HSSA and HSSB will realize that chose FH will yield greater returns. At this time, chose FH may appear as a balanced result. After re- peating the game model repeatedly and nearly infinitely, the Nash equilibrium will tend to be Pareto optimal, from choosing FS to choosing FH. On the other hand, if all choose FH completely and do not study at all, it is not conducive to rejuvenating the country through science and education and the de- velopment of the country. Therefore, balancing FH and FS is a key issue in education. In order to test the practical effect of theoretical analysis, the following will explore the evolution law of educational game through the Q-learning algorithm in reinforcement learning. 3. Exploration of Evolutionary Law of Educational Game Based on Reinforce- ment Learning There are two reasons for choosing reinforcement learning to explore the evolution law of educa- tional involution: (1) Education is related to the growth of adolescent e and the future of the country. 21 Therefore, game experiments cannot be done at will, but reinforcement learning can simulate the models described by game theory and conduct game experiments. (2) In the game model, it is assumed that the people in the game are completely rational, but in real life, people are usually bounded rationality, and people cannot understand the environment they are in and accurately predict the future, that is, all in- formation is not completely certain. The agent in reinforcement learning learns through trial and error under uncertainty, and finally acquires a strategy that maximizes the agent's expected return. Based on this, in order to test the analysis results of this paper, the Q-learning algorithm in reinforcement learning is used to simulate the educational game process, and to explore the evolution law and causes of edu- cational involution. 3.1 Experimental platform and experimental design This paper uses the python3.8 compiler to implement the reinforcement learning algorithm with numpy and panda function modules. Data visualization using the mataplotlip library. In the experi- mental design, HSSA and HSSB both use the Q-learning algorithm to learn the strategy. The learning rate α of the algorithm is set to 0.1, the discount factor γ=0.6, and the number of iterations is set to 10,000 rounds. The update formula of the state action of the Q-learning algorithm to the value function and the probability update formula of the action selection are as follows: 𝑄 𝑠 ,𝑎 ← 𝑄 𝑠 ,𝑎 + 𝛼 𝑟 + 𝛾 𝑚𝑎𝑥 , 𝑄 𝑠 , 𝑎, − 𝑄 𝑠 , 𝑎 (1) , / P 𝑎 =∑ , / (2) ∈ α is the learning rate, the larger the α, the faster the Q value converges, but the easier it is to oscillate; γ is the discount factor, which indicates the degree of influence of future rewards on the current action. Next, the Q-learning algorithm in reinforcement learning is used to simulate the repeated two-player educational game model to explore the evolutionary law of educational involution. 3.2 Analysis of experimental results In order to analyze how HSSA and HSSB choose FH and FS as the number of iterations increases, after 10,000 iterations of the two- player education game model, a line graph of cumulative times is drawn as shown in Figure1. Figure1. Cumulative times of HSSA and HSSB choosing FH or FS It can be seen from Figure 1: (1) With the increase of the number of iterations, both HSSA and HSSB gradually choose FS, indicating that in youth education, over time, people tend to choosing FS. (2) In 22 the vicinity of 1800 iterations, the accumulative times of the selection actions of HSSA and HSSB coincide, indicating that a demarcation point is entered at this time. Before 1800 rounds, the frequency of HSSA and HSSB choosing FH and FS was not much different, that is, there was no obvious prefer- ence for the two strategies at this time. After that, HSSA and HSSB gradually chose FS. Figure 2. Frequency for HSSA and HSSB selection of FH or FS It can be clearly seen from Figure 2: (1) Before the 1800 iterations, HSSA and HSSB chose FH more frequently, and HSSA chose FH more frequently than HSSB high. (2) After 2500 iterations, the fre- quency of HSSA choosing FS began to be gradually higher than that of HSSB; after 8100 rounds, HSSA began to reduce the frequency of selecting FS, while HSSB On the contrary, the frequency of HSSA choosing FS is still higher than that of HSSB, but there is a trend of convergence between the two. Figure 3. Average return of HSSA and HSSB It can be seen from Figure 3: (1) The average return of HSSA and HSSB are the highest before 1800 rounds. At this time, the highest average returns of HSSA and HSSB are respectively are 2.00 and 2.65. Combining with Figure 2, it can be seen that at this time, HSSA and HSSB select FS at frequency of 34% and 42%, respectively. (2) The average return of HSSB gradually decreased from 2500 rounds to 8100 rounds, and was much lower than that of HSSA, while in the first 2500 rounds, the average return of HSSB was higher than that of HSSA. This phenomenon shows that people who take the lead in increasing the frequency of FS can make their own average returns higher than others in the short term, but after long-term games, their average returns will decrease rapidly and are much lower than others. Therefore, in youth education, those who take the lead in increasing the frequency of FS will be detri- mental to others and themselves. 23 Comprehensive analysis of Figure 1, Figure 2 and Figure 3 can be obtained: (1) There is a process of involution from brewing to formation, then intense and then deadlocked. (2) The specific evolution law of educational involution is as follows: in the first 1800 rounds, the frequency of players choosing FS increased slowly and was lower than 50% but the average return did not drop significantly, and involution was in the brewing period. From the 1801st round to the 2500th round, the frequency of the two players choosing FS increased rapidly, and the average return of the player with higher frequency decreased significantly, and the player entered the involution first, and the involution was in the for- mation stage. From the 2501st round to the 8100th round, the frequency of the first invokers to choose FS reached about 60%, but the average return dropped significantly to the bottom, but then the fre- quency and average return of the invokers to choose FS surpassed and rose slowly; the involution was in an intense period. From 8101 rounds to 10000 rounds, the frequency of players choosing FS con- verged between 70% and 82%, and the average return of those who entered the involution first increased, while the average return of those who entered the involution decreased. A new round of involution is showing signs, but the frequency of the two choosing FS is very high, it is difficult to exert force again, and the involution is in a deadlock. (3) At the demarcation point formed by educational involution, the frequency of players choosing FS is 50%, and this also indicates that the natural balance of choosing FH and FS is about to be broken. In the later stage, although the frequency of players choosing FS decreased for a short time, it is still higher than 50%. Therefore, the frequency of home-school students choosing to focus on scores is higher than 50% as one of the main causes of educational involution. (4) In the stage where the educational involution is not yet formed, the average return of the players in the two games is the largest, which can be obtained by combining the three graphs. With a frequency of about 35% to 45%, the players in the game can get the best return, and possibly avoid the formation of educational involution. 4. Summary and Outlook The main contributions of this paper are as follows: (1) From the perspective of the Prisoner's Di- lemma, it theoretically analyzes the reasons for the formation of youth education involution; (2) The Q- learning algorithm in reinforcement learning is used to simulate a bounded rational player education game experiment; (3) Analyzed the evolutionary laws and causes of educational involution, and ex- plored an optimal strategy: "focus on scores " with a frequency of about 35% to 45%. Future work can be considered from the following aspects: (1) Differentiating the players in the model. When modeling, the two players in the education game are assumed to be indistinguishable people by default, but in reality, there are individual differences among the educational subjects. In the later stage, the individual differences of educational subjects can be specifically analyzed, and an asym- metric educational game model can be constructed. (2) To explore how long it will take to change the status of education involution after the country proposes the "double reduction" policy. 5. Acknowledgements This work was financially supported by the National Natural Science Foundation of China (72031009) and Hubei Province Key Laboratory of Systems Science in Metallurgical Pro- cess(Y202105). 6. References [1] China Youth Daily. Ministry of Education of the People's Republic of China:In 2020, the nine-year compulsory education consolidation rate was 95.2%, and the gross enrollment rate of higher edu- cation was 54.4% [DB/OL]. https://baijiahao.baidu.com/s?id=1693014680894029763&wfr=spi- der&for=pc,2021-03-01 15:43 [2] Xiaolan Fu, Kan Zhang, Xuefeng Chen, etc. Report on National Mental Health Development in China (2019-2020)[M]. social sciences academic press(CHINA).2021:143-164. 24 [3] Goldenweiser A.Loose ends of theory on the individual, pattern, and involution in primitive soci- ety[J].Essays in anthropology,1936: 99-104. [4] Hong Wang, Zhi Chen. On the Logic and the Path of “Double Reduction” Policy through the Per- spective of Involution[J]. Education&Economy,2021,37(06):38-43+61. [5] Zhujun Huang. On the Involution of Education in the Transition Period and Its Deciphering Path [J]. Journal of East China Normal University (Educational Sciences),2012,30(02):37-41+47. [6] Rong Mao. Universal Higher Education and Justice as Fairness[J]. Jiangsu Higher Educa- tion,2021(08):1-6. [7] Xiong Yang. The Roots and Cracks of "Educational Involution" in the AI Era [J]. Social Sciences Digest,2021(11):4-6. [8] Youhua Chen, Guo Miao. Enrollment Tournament,Educational Involution and Stratification of School District [J]. Journal of Jiangsu Administration Institute,2021(03):55-63. [9] Cheng Chen, Lei Bao.The Origin Involution and Solutions to Address Involution in Education [J]. China Examinations,2022(02):81-88. [10] Axelrod R, Hamilton W D. The evolution of cooperation[J]. science, 1981, 211(4489): 1390-1396. [11] Sutton R S, Barto A G. Reinforcement learning: An introduction[M]. MIT press, 2018:1-22. 25