=Paper=
{{Paper
|id=Vol-3353/paper3
|storemode=property
|title=Support System for Exchange Students for the Validation of Courses using Text Mining
|pdfUrl=https://ceur-ws.org/Vol-3353/paper3.pdf
|volume=Vol-3353
|authors=Solange E. Barreda-Muñoz,Sebastian E. Quiroz-Cervantes,Jorge L. Martínez-Muñoz,Jose A. Sulla-Torres
|dblpUrl=https://dblp.org/rec/conf/citie/Barreda-MunozQM22
}}
==Support System for Exchange Students for the Validation of Courses using Text Mining==
Support System for Exchange Students for the Validation of Courses using Text Mining Solange E. Barreda Muñoz 1, Sebastian E. Quiroz Cervantes 1, Jorge L. Martínez Muñoz1 and Jose A. Sulla-Torres 1 1 Universidad Católica de Santa María, Urb. San José s/n Umacollo, Arequipa, 04000, Perú Abstract Universities usually have a service to carry out student exchanges through the agreements established between the institutions. This service follows specific steps, including reviewing the syllabi, which generally takes a long time due to inconvenience. That is why an alternative solution is proposed with the use of text mining to compare the syllabus of each course and thus ensure an optimal and effective validation of courses. For this, the proposal has been based on the Rational Unified Process following all its phases and the R language for its implementation with a case study of a university in the city of Arequipa-Peru. The results showed that it was possible to find the most frequent and related words in the similarity of the syllabic documents. Therefore, it is concluded that the contribution provided by Text Mining helped improve the process of validating courses for student exchange between institutions. Keywords 1 Text mining, student exchange, syllabus comparison, assessment, text analysis. 1. Introduction According to data from Peruvian universities, many undergraduate and postgraduate students exchange at a foreign university through an established agreement each year. This program has more than 30 destination countries. It has a series of requirements, among them validating the syllabus if they are going to study in such a way that they have the opportunity of an international experience [1]. The efficiency in the academic and administrative procedures of higher education marks the competitive advantage in quality aspects; in that sense, the treatment of the study plans (syllabus) in many cases is done manually, delaying many educational processes [2]. The problem lies when in the exchange process, the student goes to the destination university and has to choose the courses according to the name, which does not guarantee the similarity between them, an aspect that is considered ineffective since the name does not ensure that the content of one course is the same or very similar to another. This makes the experience less attractive since, among other objectives, the student gets to learn more by living an experience abroad and not generate inconveniences with bureaucratic or cumbersome procedures. The process of carrying out the exchange is often not automated; it requires certain administrative areas such as Cooperation and International Relations Offices, the Directorate of the Professional School, and the Parts Table, which means that the time it takes This process takes more than a week due to other internal processes that are considered as requirements to deliver the complete documents, for example, one of the requirements is a psychological study. On the other hand, for the universities, it is a situation that is expected to be feasible since they are the ones that must accept the exchange and communicate it both to the respective offices and the student. Likewise, the Cooperation and International Relations Offices must inform the School Director about the student's exchange before their trip. CITIE 2022: International Congress of Trends in Educational Innovation, November 08–10, 2022, Arequipa, Peru EMAIL: 72035128@ucsm.edu.pe (A. 1); 71569311@ucsm.edu.pe (A. 2); jmartinez@ucsm.edu.pe (A. 3); jsullato@ucsm.edu.pe (A. 4) ORCID: 0000-0001-5353-184X (A. 1); 0000-0001-8743-4255 (A. 2); 0000-0003-0229-3508 (A. 3); 0000-0001-5129-430X (A. 4) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) On the other hand, text mining is increasingly helpful in the treatment of documentation in different fields and in the handling of information in documentary procedures. The text mining methodology can be used for educational tasks [3]. The proposal aims to help students compare syllables through text mining, which through the keyword frequency approach, supports a variety of operations in the Knowledge Discovery in Textual Databases (KDT), providing a base suitable for knowledge discovery and exploration for collections of unstructured text. This research also seeks the construction of the WebApp called SwapMe to contribute to the automation and improvement of the university exchange system. 2. Literature review Among the different investigations on the subject, the article on text mining in education [4] was reviewed, which talks about the growth of online education, which produces exciting challenges on how to extract text data to find valuable knowledge for those interested in education. This paper helps us to see text mining is used and how it will be applied in the educational environment. The main objective is to answer three research questions: What are the most used text-mining techniques in educational settings? What are the most used educational resources? Moreover, what are the main applications or educational objectives? In the paper by Feldman, Dagan, and Hirsh [5] on Text Mining Using Keyword Distributions, they describe the KDT system for knowledge discovery in text, in which keywords tag documents, and knowledge discovery is This is done by analyzing the frequencies of simultaneous occurrence of the various keywords that tag the documents. Kadampur and Riyaee [6] present a web-enabled, AI-powered, personalized service application for educators to automatically configure a quiz on the selected topic or syllabus. The AI component of the app works on text mining and content classification. It also helps suggest questions about the content of the supplied text. In the article by Ito et al. [7], they propose a project-based learning educational system that uses artificial intelligence instead of direct instruction from teachers. For this, they used the e-Syllabus that the teachers have used as a communication tool with the students. Some have been using the e-Syllabus by carrying out flipped and active learning [8]. They consider text mining vital as it transforms unstructured text into structured words and extracts meaningful patterns. You can explore and discover new meanings in the text data. In the work of De Aires, Corría and Godinho [9] explored how student engagement can be promoted through transmedia using a set of activities within the Moodle learning management system for a syllabus topic on innovation for an entire semester. To perform the data analysis, we followed a mixed method approach between descriptive statistics, data mining analysis based on the Orange open-source software, and a final questionnaire. Kaibassova et al. [10] review the use of intellectual data analysis methods to form educational programs in the context of determining the sequence of studying disciplines in the direction for consideration. The article succinctly describes the developed software application that allows extracting information from text documents, processing, analyzing and visualizing data. The proposed model performs the grouping of text documents taking into account the weighting coefficient of the individual words of the corpus. According to Kawintiranon et al. [11] indicate that curriculum analysis is attracting widespread interest in the educational field. Point out two main approaches: (i) human-based and (ii) text-based assessments. They feature an automated text-based curriculum analysis that directly assesses complete course materials. The approach employs a well-known text mining technique that extracts keywords using TF-IDF. The analysis is based on the keywords in the course materials that match the keywords in the online documents, which is like the domain expert. Another similar study was carried out in [12] and also for curriculum validation [13]. Yasukawa, Yokouchi, and Yamazaki [14] investigated the searchability of a collection of curricula and compared methods for word suggestions using deep learning approaches and large text corpora. In the experiment, they used a bibliographic database from university libraries in Japan. The results indicated that a wide range of vocabulary is advantageous in improving the searchability of syllabuses. Ajmal and Tak [15] present an efficient text-mining method that focuses on extracting and updating unknown words to improve data classification and POS tags. The System's main feature is finding such unknown foreign words and updating them to the appropriate words, which depends on the information available through the dictionaries. The proposed methods can also help improve the accuracy of extracting frequent patterns and association rules from unstructured (textual) data. 3. Methodology The Project is based on the Rational Unified Process (RUP) methodology [16]. According to the Project's characteristics, the participants' roles, the activities to be carried out, and the artifacts (deliverables) were selected. The four phases will be fulfilled, observed in Figure 1, which marks the methodology, consisting of the phases of initiation and elaboration: one iteration, three iterations of the construction phase, and two iterations of the transition phase. Figure 1: Phases of the RUP Method 3.1. Phase 1: Design In this initiation phase, the workflows are shown as the main requirements (see Table 1) to define and agree on the Project's scope with the interested parties and identify the possible risks for the Project, shown in Table 2. Table 1 Main Requirements Request Description Authentication It will allow the user to enter the System through an authentication method Visualization of The application will show the agreements that the University has Agreements Visualization of The System will allow you to see the universities in which the Universities exchange can be carried out University selection You can select the University where you want to do the exchange Racing information display After selecting the University, the career will be selected, and the study plan can be seen Course Comparison Once the destination university has been selected, the course comparison process can be carried out Text Mining These techniques will be used for the development of the comparison process Comments and ratings Universities will have a comments section where students will talk about their experiences and give an assessment of their exchange Change of language The System will be able to change the language depending on the user's preference Table 2 Risks external to the Project Risk ID Description Risk R01 Resistance to change Scope not accepted by the client R02 Information obtained out of date Quality failures in the System R03 The end user does not understand Scope not accepted by the client the operation of the System R04 Lack of professional career curricula Do not possess the syllables; without them, the System will not be able to function 3.2. Phase 2: Elaboration In the elaboration phase, we focus on the design and analysis aspects to verify that the Project is viable and to know the technologies that will be used and, as the main result, the obtaining of a stable architecture. 3.2.1. Technology The most important technology is text mining as a life cycle [17] for elaborating the proposal. Figure 2 shows the life cycle of how your application will be carried out compared to the syllabus. Figure 2: Text Mining Lifecycle 3.2.2. Architecture The description of the Logical Architecture is shown in Figure 3; it is composed of 5 layers where we have the most important aspects that we will describe below. Figure 3: Logical architecture of the proposal. 1. User Layer: In this layer, you can see the actors that interact with the System; in the architecture, we can identify them as students and administrators, both can access the System, but it has different functionalities available. 2. Presentation Layer: In this layer, the technologies that are going to be used for the development of the application are presented, which in this case were already defined as the classic web technologies that are HTML, CSS, and JAVASCRIPT, and that will be implemented with the use of VSCode, which is our selected text editor and which facilitates the development of code thanks to its intelligent autocompletion and extensions such as Live Server that allows us to see the visualization of all the changes that we are making. 3. Web Service Layer: In this layer are the various web services that will provide us with a secure way to access the System and the databases. 4. Business Layer: In this layer are all the processes that users can use, from entering the System and validating their credentials to requesting the comparison of syllables between different universities for students and, in the case of administrators, adding new universities, racing, and other options. 5. Text Mining Layer: In this layer, the text mining process is carried out in order to verify the compatibility between careers, where the syllables to be compared are first extracted; after having it, they are cleaned, that is, we eliminate words such as articles, punctuation marks, and connectors, in order to obtain the keywords and then be able to carry out the clustering and classification processes [18] in order to obtain a percentage of similarity and ensure that the same topics are covered in both universities and ensure that the student can carry out the exchange with more security. 6. Data Layer: In this layer, the Database will be used for the correct functioning of the application. We have the University's Database and the other databases of the other universities with which it can be done exchanges. 3.3. Phase 3: Construction In this phase, the priority is to reach the operational capacity of the product incrementally with iterations; we are dedicated to mitigating all the risks found in the development phase and implementing all the characteristics and requirements previously described to obtain a version of the WebApp called SwapMe. Figure 4 shows the Class Diagram developed. Figure 4: Class Diagram. The materials used were the R Studio tool [19] with its R language, which is a free and open-source multiplatform programming environment. R was specifically designed for statistical analysis, which makes it well-suited for data science applications. The tools it provides in R make it easy to perform robust text analysis. Shiny is a popular R package that makes it easy to create interactive web applications directly from R. You can host stand-alone applications on a web page, embed them in R Markdown documents, or create dashboards. The Database was managed using the SQL Server 2017 tool to administer student data, teachers, courses, syllables of each course, and University agreements. 3.3.1. Syllabus Text Mining In order to extract information about the courses from the documents provided by the universities, we will use text mining. To carry out the comparison process, we have used the R programming language, which has a statistical analysis approach focused on business intelligence that has algorithms for data mining development. Text analysis, in particular, has stabilized mainly in R. There is a vast collection of libraries devoted to text processing and analysis, from low-level string operations to advanced-level text modeling techniques example: the adjustment of allocation models. One of the main advantages of performing text analysis in R is that it is usually accessible and relatively not so complicated to combine packages or libraries, where the complexity of analyzing is seen in the interpretation of each of the calculations that are performed when handling the imported data. This challenge plays an increasingly important role for developers in terms of cooperation and coordination. Recent research among the R text analytics developer community is designed to promote interoperability to increase implementation flexibility on this topic and thus result in learning basic implementation concepts by parsing texts and providing access to a wide range of advanced functionality for this challenge. Table 3 shows the structure of the operations in the R packages to be used. Table 3 Structure of package operations in R Operation Library Options Data Preparation readtext Import of Texts stringi txt, readxl, pdftools String Operations quanteda stringr Preprocessing quanteda stringi,tokenizers,snowballC,tm DTM quanteda tm, tidytext, matrix Filtration and weighing quanteda tm, tidytext, matrix Analysis quanteda Dictionary quanteda tm, tidytext Supervised Machine learning quanteda RTextTools,kerasR Unsupervised Machine topicmodels quanteda, stm, text2vec Learning In the table above, you can see the operations in a sequence of how the texts should be analyzed. For a better understanding, Figure 5 shows the diagram of the synthesized steps of the analysis of the syllables: tokens, tokens, sillabus.txt R Corpus results lemmatization lemmatization Figure 5: Stages of syllabus analysis. First, we chose two syllabi from different universities with similar names to analyze their content. Before comparing the text of both syllables, it is necessary to carry out the first step of text mining: data preprocessing. We need to install a few packages in RStudio, which is the tool we will use to develop with the R language. After installing these packages, we define the folder where we will get the text from and import the packages to the created Project; each package has different functionality, such as: - tm: Specific for text mining. - wordcloud: To graph the word cloud that we will see later. - dplyr: With functions to manipulate and transform the data, it allows you to change or delete the operator so that the text is more readable. - readr: Allows you to read and write documents. - cluster: To perform group analysis. Once the file is read, we define from which line block the document will be read. We are left with elements equal to the number of lines read in the document and the selected range. We combine the column number with what each column contains. It is also necessary to clean the data by converting everything to lowercase, eliminating empty words, among others, eliminating the numbers because, in this case, they are not necessary, and finally, the blank spaces. We convert everything to plain text and draw the word cloud [20] shown in Figure 6. It is necessary to eliminate specific words that are repeated so that the words shown at the end are essential and have more meaning. For example, the word practice or phase, in this case, is not applicable, so we eliminate it and redraw the word cloud. Figure 6: Generated word cloud. In this field of text mining, we usually start with a set of records that, in the R language, are called highly heterogeneous input documents. For this reason, the first step to execute is to import it into a computational environment. It is basically what we will use R for since this tool with the tm package is a standard for statistical analysis of text in R. Once the files have been imported, which in this case have been converted to .txt, the first cleaning of the imported data is carried out to avoid noise when analyzing it; that is, remove punctuation marks, integers and decimals, blank spaces, and, finally, as a cleaning standard, the texts are converted to lowercase. Now with the first cleaning to be able to evaluate these texts, different structures will be needed: The first structure is considered data.frame, a data frame (table) of two-dimensional matrix type in which each column has a variable's values and each row has a set of values in the respective columns. Table 4 Structure of the data frame N° id text 1 ucsm.txt (Syllabus content 1) 2 alas.txt (Syllabus content 2) The second is considered similar to that of a database that will contain and help to manipulate the documents in a general way, shown in Table 5. This structure is called corpus [10], found in both the tm and quanteda libraries. For our Project, we will use both since that way; we will properly clean the imported data, as well as make it easier for us to handle it. Table 5 Corpus structure library structure tm id - text (secondcleaning: to_Lower) quanteda Corpus consisting of 2 documents and 2 docvars Once the first part of the general structure of the documents has been built, a total of documents with the characteristic of being high-dimensional is obtained; that is, large. Since we already have the corpus, the next step is its processing, which means applying the methods to finish cleaning and structuring the input, thus also identifying the characteristics in a simplified set that can be analyzed later. Among the applications to use in the text are: ID1 => tm_map(corpusCompleto,stripWhitespace), y ID2 => tm_map(corpusCompleto,removeWords,stopwords("spanish")), where the white spaces will be eliminated (ID1) and an important part the elimination of empty words that are considered in the Spanish language as prepositions. From what has already been captured and obtained (corpus), one of the essential points when delving deeper into data mining is the creation of the document-terms matrix, which contains rows that correspond to the imported documents and the columns to their terms, which is shown in Figure 7. Figure 7: document-term matrix 3.4. Phase 4: Transition The final stage of RUP, where the product obtained after all the iterations carried out, is placed in the hands of the end users to verify that our product obtained meets all the standard requirements and satisfies the users correctly. The main deliverables are the operational prototype of the application and all the necessary documentation. The WebApp software was made using the R language using the Shiny library, which allows the creation of web pages using HTML, CSS, and JavaScript. Figure 8 shows the main window of the built WebApp. Figure 8: WebApp main window Figure 9 shows the window of available universities and agreements. Figure 9: Window of available universities and agreements. 4. Results In this section, the evaluation results of the built application were obtained. The distance between the syllabic documents was found to see their similarity according to the following equation 1. 𝑑𝑜𝑐1 , 𝑑𝑜𝑐2 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦(𝑑𝑜𝑐1 , 𝑑𝑜𝑐2 ) = cos(𝜃) = (1) |𝑑𝑜𝑐1 ||𝑑𝑜𝑐2 | According to the Euclidean distance, it shows the similarity percentage: text1 text2 is30.09983. The word frequencies obtained in the tests are shown in Figure 10. Figure 10: Word frequency Another aspect of interest when analyzing these texts is seeing the relationship-specific keywords with others. To do this, we are going to see next to which words the words appear: "intelligence," "learning," "diffuse," "deep," and "learning." Figure 11 shows the relationships between the words with similarity percentages. Figure 11: Relationships between words with similarity percentage 5. Conclusions Text mining is beneficial for any organization as it can save money and solve problems that can be used for proper decision-making. In the process, the elimination of specific words that are repeated is an essential aspect since it selects words to show the most significant meaning, which was obtained according to the obtained frequency tables. The relationship of keywords with others is another essential aspect since it allows knowing the similarity of keywords that appear and thus forming a basis for comparison through the distances used. The developed AppWeb, allowed us to have a better interaction in usability with the users in a personalized way through the interfaces. It is a very friendly and operative application for the student since we can use it both on the web and as an application. Likewise, it can be concluded that this application would help institutions to be more efficient in the student exchange process since it is a more practical way to apply to the universities with which a university is affiliated. On the other hand, the most important thing is to make the agreements; it makes it easier for students to verify and validate if it is convenient to exchange to the chosen University. 6. References 1. ¿Por qué participar en un programa de intercambio? - Portal de Internacionalización | PUCP, https://internacionalizacion.pucp.edu.pe/intercambio-estudiantil-pucp/por-que-participar-en- un-programa-de-intercambio/, last accessed 2022/10/13. 2. Meza-Luque, A., Del Carpio, A.F., Paredes, K.R., Sulla-Torres, J.: Architectural proposal for a syllabus management system using the ISO/IEC/IEEE 42010. Int. J. Adv. Comput. Sci. Appl. (2020). https://doi.org/10.14569/IJACSA.2020.0110640. 3. Orellana, G., Orellana, M., Saquicela, V., Baculima, F., Piedra, N.: A text mining methodology to discover syllabi similarities among higher education institutions. In: Proceedings - 3rd International Conference on Information Systems and Computer Science, INCISCOS 2018 (2018). https://doi.org/10.1109/INCISCOS.2018.00045. 4. Ferreira-Mello, R., André, M., Pinheiro, A., Costa, E., Romero, C.: Text mining in education, (2019). https://doi.org/10.1002/widm.1332. 5. Feldman, R., Dagan, I., Hirsh, H.: Mining text using keyword distributions, (1998). https://doi.org/10.1023/A:1008623632443. 6. Kadampur, M.A., Riyaee, S. Al: QPSetter: An Artificial Intelligence-Based Web Enabled, Personalized Service Application for Educators. In: Lecture Notes in Networks and Systems (2022). https://doi.org/10.1007/978-3-030-82193-7_51. 7. Ito, T., Tanaka, M.S., Shin, M., Miyazaki, K.: THE ONLINE PBL (PROJECT-BASED LEARNING) EDUCATION SYSTEM USING AI (ARTIFICIAL INTELLIGENCE). In: Proceedings of the 23rd International Conference on Engineering and Product Design Education, E and PDE 2021 (2021). https://doi.org/10.35199/epde.2021.19. 8. Ito, T., Ishii, K., Nishi, M., Shin, M., Miyazaki, K.: Comparison of the effects of the integrated learning environments between the social science and the mathematics. In: SEFI 47th Annual Conference: Varietas Delectat... Complexity is the New Normality, Proceedings (2020). 9. De Aires Angelino, F.J., Loureiro, S.M.C., Bilro, R.G.: Analysing students' engagement in higher education through transmedia and learning management systems: A text mining approach. Int. J. Innov. Learn. 30, (2021). https://doi.org/10.1504/IJIL.2021.118875. 10. Kaibassova, D., La, L., Smagulova, A., Lisitsyna, L., Shikov, A., Nurtay, M.: Methods and algorithms of analyzing syllabuses for educational programs forming intellectual System. J. Theor. Appl. Inf. Technol. 98, (2020). 11. Kawintiranon, K., Vateekul, P., Suchato, A., Punyabukkana, P.: Understanding knowledge areas in curriculum through text mining from course materials. In: Proceedings of 2016 IEEE International Conference on Teaching, Assessment and Learning for Engineering, TALE 2016 (2017). https://doi.org/10.1109/TALE.2016.7851788. 12. Föll, P., Thiesse, F.: Exploring Information Systems Curricula: A Text Mining Approach. Bus. Inf. Syst. Eng. 63, (2021). https://doi.org/10.1007/s12599-021-00702-2. 13. West, J.: Validating curriculum development using text mining. Curric. J. 28, (2017). https://doi.org/10.1080/09585176.2016.1261719. 14. Yasukawa, M., Yokouchi, H., Yamazaki, K.: Syllabus Mining for Faculty Development in Science and Engineering Courses. In: Proceedings - 2019 8th International Congress on Advanced Applied Informatics, IIAI-AAI 2019 (2019). https://doi.org/10.1109/IIAI- AAI.2019.00074. 15. Khan, I.A., Choi, J.T.: Lexicon-corpus Based Korean Unknown Foreign Word Extraction and Updating Using Syllable Identification. In: Procedia Engineering (2016). https://doi.org/10.1016/j.proeng.2016.07.445. 16. Jones, C.: 50 Rational Unified Process (RUP). In: Software Methodologies A Quantitative Guide (2017). https://doi.org/10.1201/9781315314488-51. 17. Castellano, M., Mastronardi, G., Aprile, A.: A web text mining flexible architecture. World Acad. 1, (2007). 18. Nisa, R., Qamar, U.: A text mining based approach for web service classification. Inf. Syst. E- bus. Manag. 13, (2015). https://doi.org/10.1007/s10257-014-0252-5. 19. Martin, G.: R Studio. In: An Introduction to Programming with R (2021). https://doi.org/10.1007/978-3-030-69664-1_1. 20. Bao, C., Wang, Y.: A Survey of Word Cloud Visualization, (2021). https://doi.org/10.3724/SP.J.1089.2021.18811.