Construction and Application of Subject Knowledge Graph for Basic Education Tao Xu1,2,3, Xiaqing Ma1,2, Fengsi Wang3,4, Chun Liu2,4, Zixiang Zhang1,2, Peiming Lu2, Daojun Han2, * 1 Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng, 475004, China 2 School of Computer and Information Engineering, Henan University, Kaifeng, 475004, China 3 Henan Technology Innovation Center of Spatio-Temporal Big Data, Henan University, Zhengzhou, 450046, China 4 Henan Industrial Technology Academy of Spatio-Temporal Big Data, Henan University, Zhengzhou, 450046, China Abstract In view of the problem that the distribution of knowledge points in the current knowledge system in the form of chapters can no longer meet the needs of students and teachers. This paper proposes to use SVM to identify knowledge points of tests, build a subject Knowledge Graph based on the identified knowledge points, and design and implement a subject Knowledge Graph system. This paper takes the mechanical movement of the first chapter of the eighth grade (upper) physics course as an example to conduct empirical research. By 10- fold cross-validation, the average F1 value of the algorithm used in this paper is 89.66%. Keywords1 Knowledge system, SVM, knowledge point information, subject Knowledge Graph, system 1. Introduction In the existing teaching resources, knowledge points in the knowledge system are distributed in the form of chapter levels. However, in the face of fiercely competitive academic assessment, the difficulty of the assessment content of various tests is increasing day by day, focusing on the fusion of knowledge points. Therefore, the knowledge system distributed in the form of chapter levels can no longer meet the needs of students' learning and teachers' teaching. This paper argues that the Knowledge graph can show the complex correlation of knowledge points. At the same time, it also meets the needs of students and teachers better. Knowledge Graph, as a form of structuring human knowledge, have attracted great attention from both academia and industry [1]. The Knowledge Graph describes the relationships between entities in real life in the form of triples. In view of the excellent form of Knowledge Graph, the research on subject Knowledge Graph has become a research hotspot of subject knowledge system in recent years. Chen et al. used TextRank to extract knowledge points from course introductions to construct a Knowledge Graph [2]. Cheng et al. used TF-IDF to retrieve teaching courseware and mine knowledge points to build a Knowledge Graph [3]. Su et al. evaluated the relationship between knowledge points by calculating the semantic similarity, PMI, and normalized Google distance between knowledge points [4]. Different from these studies, this paper believes that the relationship between knowledge points in the tests can better show the complex association of knowledge points. Based on the above problems, this paper proposes to use the SVM [5] algorithm to identify the knowledge points of the tests, build the subject Knowledge Graph according to the identified knowledge point information, and design and implement the subject Knowledge Graph system based on test big ISCIPT2022@7th International Conference on Computer and Information Processing Technology, August 5-7, 2022, Shenyang, China EMAIL: txu@henu.edu.cn (Tao Xu); Maxq@henu.edu.cn (Xiaqing Ma); xxgcwfs@163.com (Fengsi Wang); liuchun@henu.edu.cn (Chun Liu); zzx1289@henu.edu.cn (Zixiang Zhang); 978726425@qq.com (Peiming Lu) * Corresponding author’s email: hdj@henu.edu.cn (Daojun Han) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 57 data. The first chapter of the physics course in grade 8 of junior middle school is empirically studied to realize the knowledge point identification, the construction of a subject Knowledge Graph and the query of knowledge points, and the construction of a subject Knowledge Graph system. The method of knowledge point recognition and the construction of a subject Knowledge Graph are evaluated. 2. Technical framework 2.1. Technology route This paper aims to realize the identification of the knowledge points of the tests and the construction of the subject Knowledge Graph based on the teaching materials and tests, and to query the knowledge points according to the subject Knowledge Graph. The technology route is roughly divided into four steps. First, data acquisition is performed, including manual extraction of chapters, sections, and knowledge points information from the textbook and manual marking of knowledge points for the tests according to the extracted information. Then, this paper realizes the recognition algorithm of knowledge points, including preprocessing test points, extracting test points features by TF-IDF [6] method and realizing the recognition of test points by SVM classification algorithm. Next, using the chapters, sections, and knowledge points information extracted from the textbook and the tests constructs the subject Knowledge Graph. Finally, this paper uses the subject Knowledge Graph to query knowledge points. 2.2. Identification of knowledge points in tests It takes three steps to realize the identification of the knowledge points in the tests. First, the tests data is preprocessed. Secondly, the TF-IDF method is used to extract the features of the test data. Finally, the SVM classification algorithm is used to identify the knowledge points of the tests. 2.2.1. Preprocessing of test data In order to improve the accuracy of tests identification, it is necessary to preprocess the data. The preprocessing steps of tests data include tests data modeling and tests word frequency matrix construction. (1) Tests data modeling First, the pictures on the test are cleaned to obtain the tests data in plain text. When modeling the tests data, attributes such as "chapter", "test type", "knowledge point" of the tests are introduced to create the tests data model. That is, the tests data can be defined as β„š =< π‘žπ‘– |0 < 𝑖 ≀ 𝑛 >, where π‘žπ‘– = π‘Š, 𝑃, 𝑐, 𝑙, π‘˜ represents the 𝑖 -th tests, π‘Š represents the tests data after cleaning the picture, 𝑃 represents the word segmentation result, 𝑐 represents the chapter to which the tests belong, 𝑙 represents the type of the tests, π‘˜ represents the knowledge point of the tests, and 𝑛 represents the number of tests. (2) Constructing tests word frequency matrix Jieba determines the association probability between Chinese characters through a Chinese thesaurus and forms phrases with high probability between Chinese characters to form word segmentation results. This paper uses Jieba to segment the test to get 𝑃𝑠 . After completing the word segmentation, the word segmentation results are cleaned, and the irrelevant phrases π‘ƒπ‘Ž such as prepositions and adjectives are screened out, so that π‘žπ‘– . 𝑃 = 𝑃𝑠 βˆ’ π‘ƒπ‘Ž . The complete set of phrases after word segmentation and screening of all tests data is denoted as 𝑝𝑠 = {𝑝𝑗 |0 < 𝑗 ≀ π‘š}, where π‘š represents the number of phrases, and 𝑝𝑗 represents the 𝑗-th phrase in the complete set of phrases. Then construct a word frequency matrix Μ =< 𝑓𝑖𝑗 > according to the complete set of phrases, and 𝑓𝑖𝑗 represents whether the 𝑗-th phrase 𝑝𝑗 appears in the 𝑖-th tests. 58 2.2.2. TF-IDF method to extract tests features Since the word frequency matrix obtained from the preprocessing results still has a large number of word segmentations, this paper uses the TF-IDF method to extract the features of the tests and optimize the word frequency matrix. The TF-IDF method is often used to evaluate the importance of a word or phrase to a document set or one of the documents in a corpus. Therefore, this paper uses the TF-IDF method to extract the more important feature words from the tests corresponding to each knowledge point. TF-IDF consists of two parts, TF and IDF. TF refers to word frequency, which indicates the frequency of the word or phrase appearing in the knowledge point corresponding to the tests. The calculation of TF is shown in equation 1. 𝑑 𝑇𝐹 = 𝑑𝑐 , (1) 𝑠 Among them, 𝑑𝑐 represents the number of tests in which the word or phrase appears in the tests corresponding to the knowledge point, and 𝑑𝑠 refers to the total number of tests corresponding to the knowledge point. IDF refers to the inverse document frequency, the value of IDF is inversely proportional to the frequency of the word or phrase in the tests bank. The calculation of IDF is shown in equation 2. 𝑇𝑠 𝐼𝐷𝐹 = π‘™π‘œπ‘” (𝑑 +1 ), (2) 𝑐 Among them, 𝑇𝑠 represents the total number of tests in the tests bank. To avoid having a 0 in the denominator, the denominator in the equation needs to be added by 1. Therefore, the calculation of TF-IDF is shown in equation 3. 𝑇𝐹 βˆ’ 𝐼𝐷𝐹 = 𝑇𝐹 βˆ— 𝐼𝐷𝐹, (3) 2.2.3. SVM classification algorithm In this paper, the SVM algorithm is used to classify each knowledge point label into two categories so as to realize the identification of the knowledge points in the tests. The basic idea of the SVM algorithm is to find the hyperplane that can correctly divide the training set and has the largest interval. The steps to solve the SVM hyperplane are as follows: β‘  Define the training set 𝐷 = {(π‘₯1 , 𝑦1 ), (π‘₯2 , 𝑦2 ), β‹― , (π‘₯𝑀 , 𝑦𝑀 )}, where π‘₯ is an 𝑛-dimensional feature vector, and the 𝑦 value takes the form of 1 or -1. Then select the kernel function 𝐻(π‘₯, 𝑧) and the penalty coefficient 𝐢, calculate the hyperplane when the interval is the largest, so that the closest point to the hyperplane is as far away from the hyperplane as possible. Next, use Lagrange to solve optimization problem, the result is shown in equation (4-6), and the solution obtained by the 𝛼 vector is represented by 𝛼 βˆ— . 1 π‘šπ‘–π‘› {2 βˆ‘π‘€ 𝑀 𝑀 𝑖=1 βˆ‘π‘—=1 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐻(π‘₯𝑖 , π‘₯𝑗 ) βˆ’ βˆ‘π‘–=1 𝛼𝑖 }, (4) 𝛼 𝑠. 𝑑. βˆ‘π‘ 𝑖=1 𝛼𝑖 𝑦𝑖 = 0, (5) 0 ≀ 𝛼𝑖 ≀ 𝐢, (6) Among them, 𝛼 is the Lagrange multiplier vector, and the 𝐢 value represents the degree of penalty for misclassified points. The larger the 𝐢 value, the greater the penalty for misclassified points. β‘‘ For 𝛼 βˆ— calculated by β‘ , select a positive component 0 < π›Όπ‘—βˆ— < 𝐢 of 𝛼 βˆ— and calculate 𝑏 βˆ— as shown in equation 7. 𝑏 βˆ— = 𝑦𝑖 βˆ’ βˆ‘π‘€ βˆ— 𝑖=1 𝛼𝑖 𝑦𝑖 𝐻(π‘₯𝑖 , π‘₯𝑗 ), (7) β‘’ Finally, the classification decision function of SVM is obtained as shown in equation 8, and the classification hyperplane of SVM is shown in equation 9. 𝑓(π‘₯) = 𝑠𝑖𝑔𝑛(βˆ‘π‘€ βˆ— βˆ— 𝑖=1 𝛼𝑖 𝑦𝑖 𝐻(π‘₯𝑖 , π‘₯𝑗 ) + 𝑏 ), (8) 𝑀 βˆ— βˆ— βˆ‘π‘–=1 𝛼𝑖 𝑦𝑖 𝐻(π‘₯𝑖 , π‘₯𝑗 ) + 𝑏 = 0, (9) 59 2.3. Subject Knowledge Graph construction The subject Knowledge Graph this paper constructed in this paper defines three concept: "chapter": 𝑍, "section": 𝑆 and "knowledge point": 𝐾, and defines the subject Knowledge Graph this paper constructed as β„€ =< β„Ž, π‘Ÿ, 𝑑 >=<< 𝑍, 𝑃1 , 𝑆 >, < 𝑆, 𝑃2 , 𝐾 >, < 𝐾, 𝑃3 , 𝐾 >>, Where β„Ž represents the head entity, π‘Ÿ represents the relationship, 𝑑 represents the tail entity, 𝑃1 represents the relationship between chapters and sections at the chapter level, 𝑃2 represents the relationship between sections and knowledge points at the chapter level, and 𝑃3 represents the relationship between knowledge points and knowledge points. In this paper, the entities "chapter", "section" and "knowledge point" are extracted from the textbook through artificial mode, and the relationship between entities "chapter" and "section", "section" and "knowledge point" is defined in the form of the chapter level of knowledge point in the textbook. The knowledge point information is identified from the tests according to this paper proposed test knowledge point identification algorithm: the co-occurrence of knowledge points and the inspection frequency of knowledge points. Then judge the dependence between knowledge points and the strength of knowledge points in the test paper. Finally, the chapters, sections, knowledge points, and their relationships are stored in the Neo4j graph database in the form of triples, and the thickness of the relationship indicates the strength of the relationship between knowledge points, and the size of nodes indicates the degree of knowledge points. 2.4. Knowledge point query Conduct knowledge point inquiry, first connect the Neo4j database, and then judge whether the knowledge point 𝑒1 exists in the database, if the knowledge point exists in the database, then query the relevant information when 𝑒1 is the head entity and the tail entity, and build Knowledge Graph 𝐺(𝑒1 ) and relation list 𝐿(𝑒1 ) related to knowledge point, if the knowledge point does not exist in the database, the output "the entity has not been added to the database yet". 3. Application case studies 3.1. Data sources This paper takes the mechanical movement of the first chapter of the eighth grade (upper) physics course as an example of how to conduct empirical research. Using the domestic authoritative primary and secondary education resource website-subject network (zxxk.com), the eighth grade (upper) physics course in the first chapter of the mechanical movement part of the sample tests, simulated 13740 tests. 3.2. Evaluation of knowledge point recognition technology of tests In this paper, the SVM classification is performed on each type of knowledge point separately, so as to realize the multi-label classification of the knowledge points in the tests. First, preprocess the tests data, select the Chinese word segmentation component Jieba to segment the tests data, filter the irrelevant phrases such as adjectives and prepositions, and construct the tests word frequency matrix. Then, the TF-TDF method is used to extract the tests characteristics that can completely cover all the tests of the knowledge point. For the tests of different knowledge points, the word frequency matrix is optimized by the tests features of all knowledge points. The SVM model is trained by the optimized word frequency matrix to realize the classification of a single knowledge point in the tests and finally realize the identification of the knowledge points in the tests. As shown in Figure 1, the SVM model was optimized by adjusting the proportion of the training set, and the model performance was evaluated by the F1 value of the test set. 60 Figure 1: The change trend of F1 value with the proportion of training set The results show that the SVM algorithm model can identify the knowledge points of the tests best when the training set ratio is 9:1, and its F1 value is 95.50%. In order to further prove the validity of the model, this paper verifies the validity of the model through ten-fold cross-validation, as shown in Figure 2, which represents the F1 value of the model in different test sets. The average value of its F1 value is 89.66%, which proves that the SVM algorithm this paper use can effectively identify the knowledge points in the test set. Figure 2: F1 Value of the model on different test sets 3.3. Subject Knowledge Graph construction evaluation According to the chapter level of the knowledge points in the textbook and combined with the knowledge point information identified by the knowledge point recognition algorithm of the tests, the subject Knowledge Graph is constructed. The subject Knowledge Graph constructed in this paper contains 17 nodes: 1 "chapter" node, 4 "section" nodes, and 12 "knowledge point" nodes, and the relationships between them. As shown in Figure 3, the Knowledge Graph constructed in this paper not only retains the chapter level of knowledge points, but also includes the frequency of knowledge points in the test paper and the degree of relevance of knowledge points in the test paper. Figure 3: Subject Knowledge Graph 3.4. Subject Knowledge Graph interface design and analysis According to the method proposed in this paper, a disciplinary knowledge atlas system is constructed. The system includes three functional interfaces: test knowledge point recognition, Knowledge Graph 61 display, and knowledge point query interface. In the test knowledge point identification interface, enter the tests. The system will display the time of identification, the identified tests, and the identification result. It is convenient for users to learn correspondingly according to the recognition results. In the Knowledge Graph display interface, user can clearly observe the distribution of knowledge points in chapters, the frequency of inspection in the test paper, and the dependencies between knowledge points in the tests. In the knowledge point query interface, enter the knowledge point name. The system will display the relevant information about the knowledge point: the relationship diagram and the relationship table, which is convenient for learning about a certain knowledge point. 4. Summary and Outlook In order to solve the problem that the knowledge points in the current knowledge system are distributed in the form of chapter levels. This paper uses the SVM algorithm to identify the knowledge points of the tests. By 10-fold cross-validation, the average F1 value of the algorithm used in this paper is 89.66%, which can effectively identify the knowledge points in the test set. The subject knowledge graph is constructed by identifying the knowledge points information and manually extracting knowledge points at the textbook chapter level. According to the method this paper proposed, this paper designed and implemented the subject Knowledge Graph system and realized three interfaces in the system: test knowledge point identification, Knowledge Graph display, and a knowledge point query interface. The next work will be carried out in two aspects: optimization of the method for identifying knowledge points in tests and adding functions such as question-and-answer search and test recommendation to the system. 5. Acknowledgements This work was supported by the Educational Science Planning of Henan Province, China under Grant [2021YB0037]; the Open Fund of Scientific Research Laboratory of Henan University for undergraduates under Grant [No.71]. 6. References [1] Ji S, Pan S, Cambria E, et al. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications[J]. IEEE Transactions on Neural Networks and Learning Systems, 2021. [2] Xi C, Guang M, Jinjin Z, et al. A Method for Predicting Student Performance Combining Knowledge Graph and Collaborative Filtering[J]. Computer Application, 2020, 40(02):595-601. [3] Ping C, Xun F. The Teaching Research of MPAcc Course Based on Knowledge Graph under the Background of "Golden Course" Constructionβ€”β€”Taking the Course of "Cloud Accounting and Intelligent Financial Sharing" of Chongqing University of Technology as an Example[J]. Accounting Communications, 2019, (28):35-38. [4] Yong S, Yong Z. Automatic Construction of Subject Knowledge Graph Based on Educational Big Data[C]//Proceedings of the 2020 The 3rd International Conference on Big Data and Education. 2020: 30-36. [5] Cortes C, Vapnik V. Support-vector Networks[J]. Machine Learning, 1995, 20(3): 273-297. [6] Qaiser S, Ali R. Text mining: use of TF-IDF to examine the relevance of words to documents[J]. International Journal of Computer Applications, 2018, 181(1): 25-29. 62