=Paper= {{Paper |id=Vol-2555/paper13 |storemode=property |title=Search model of educational trends based on Data Mining techniques |pdfUrl=https://ceur-ws.org/Vol-2555/paper13.pdf |volume=Vol-2555 |authors=Rosario Huanca-Gonza,Julio Vera-Sancho,Carlos Eduardo Arbieto-Batallanos,María del Carmen Córdova-Martínez }} ==Search model of educational trends based on Data Mining techniques== https://ceur-ws.org/Vol-2555/paper13.pdf
      Search model of educational trends based on
               Data Mining techniques

                 Rosario Huanca-Gonza1[0000−0002−1437−5829] , Julio
                 Vera-Sancho2[0000−0001−5526−5223] , Carlos Eduardo
           Arbieto-Batallanos3[0000−0002−7094−4272] , and Marı́a del Carmen
                      Córdova-Martı́nez4[0000−0002−5186−6598]

                    Universidad Nacional de San Agustı́n de Arequipa
                  {rhuancag,jveras,carbieto,mcordovam}@unsa.edu.pe



          Abstract. Internet is the broadest means of communication that has ex-
          isted and is a highly effective means for the dissemination of information
          that allows access to millions of pages of textual and multimedia content,
          this leads to an information overload and a problem called infoxication,
          and Researchers and / or teachers are not the exception when search-
          ing for information on educational trends in research. For this reason,
          we propose a model to search for educational trends using Data Mining
          techniques, which will allow us to capture, analyze, disseminate and ex-
          ploit the main topics that are currently being developed on educational
          trends.

          Keywords: Data mining · educational trends · Machine learning


  1     Introduction
  At present, we live in an era where information is easily accessible and due to the
  large amount of information, and that this information that exists on the web, is
  increasing, according to an IDC report (International Data Corporation), that
  only 33% of the information is valuable, if it is analyzed, and that by 2020 this
  information will increase about 5GB [8]. Currently, as part of this great informa-
  tion, it is that infoxication appears, which is the excess of information that cre-
  ates confusion in the users of ICT. It is also known as info-saturation in relation
  to the cognitive effects produced by access to large amounts of information that
  the individual fails to appropriate [14]. Based on this great information, there is
  a need among researchers and / or educators, the search for educational trends,
  which allow improving the teaching and / or learning process, both by teach-
  ers and students, there is a large number of repositories specialized in research
  on education such as: ERIC, which is a bibliographic database of international
  coverage in the field of education, includes indexes and summaries of journal ar-
  ticles and reports, known as the documents of Education Resources Information
  Center (ERIC), from 1966 to the present, it has a monthly update frequency
  and has more than 1,341,146 records [15]. The search for educational trends in
  research, has been carried out in recent years manually, with the ability to filter
Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
2       Rosario, Julio, Carlos and Marı́a del Carmen

information that is related to search, assessment and synthesis of information.
For which the individual in an environment of abundant information is able to
critically select the information and give meaning and meaning [14] In the ad-
vances of artificial intelligence and data processing, there are investigations and
techniques that allow us to perform this entire process automatically, based on
the fact that this massive information that exists is known as Big Data, which
needs to be processed and thus generate value. For which the algorithms of of
Data mining It allows us to solve these types of problems. The types of learn-
ing are Supervised, Unsupervised and Semi supervised [5]; Supervised learning
takes a known set of input data and known responses, which are labeled, and
then make an algorithm that will generate a prediction to respond to new data,
this type of learning uses classification or regression algorithms. The unsuper-
vised learning unlike the supervised, does not have tagged data, its objective is
to find the regularities at the entrance, so that certain patterns can be found.
The phases used inData Mining are, data filtering, variable selection, knowledge
extraction, interpretation and evaluation [17] which in our proposal will help us
discover the knowledge of educational trends in research.


2   State of the art

Slamet in 2018 in his research “Web Scraping and Naı̈ve Bayes Classification for
Job Search Engine” proposes that many organizations use websites to share in-
formation about new hires for workers and that this information is overflowed in
thousands of sites with different attributes and criteria. However, this availability
of information is very complex in the selection process and leads to inefficient ex-
ecution time, which is why it proposes a simple method to simplify the job search
through a construction and development of web techniques scraping and sorting
using Naive Bayes in the search engine. In 2016, Meschenmoser in his research
“Scraping Scientific Web Repositories: Challenges and Solutions for Automated
Content Extraction” proposes strategies to programmatically access data in sci-
entific web repositories. We demonstrate the strategies as part of an open source
tool (MIT license) that allows comparisons of research performance based on
Google Scholar data, emphasizing that the Scraping included in the tool should
only be used if the operator of a repository gives its consent [11]. Klochikhin2016
in his project “Collaborative innovation beyond science: exploring new data and
new methods with computer science” mentions that bibiometry and patent anal-
ysis have laid an important basis for a better understanding of the dynamics of
innovation; The new computational methods and tools can take this analysis one
step further while providing additional information on the mechanisms of collab-
orative innovation. Web Scraping, record linking algorithms and computational
linguistics provide a wide range of approaches to facilitate, enrich and replace
traditional data sources and analytical tools. proposes to use new techniques to
study the mechanisms of scientific collaboration and the composition of research
teams; analyze innovation networks; following knowledge and ideas while com-
municating between scientists, engineers and entrepreneurs; and study the intri-
        Search model of educational trends based on Data Mining techniques       3

cate nature of university-industry links and collaborative innovation [9].Moscoso
in 2016, brings together a wide range of techniques and algorithms that allow the
extraction of knowledge from databases for decision making using data mining.
Which have been applied to different fields of study. Focusing on an important
research field such as education. The application of data mining in education
is known as educational data mining (EDM). The main objective of EDM is to
analyze data from educational institutions using different techniques such as:
prediction, grouping, time series analysis, classification, among others. This pa-
per presents a holistic view of EDM that includes the classification of algorithms,
methods and tools used in data mining processes. In addition, the processes and
indicators that could be improved are analyzed in educational institutions. This
study covers papers submitted from 2005 to 2015 [12].


3     Theoretical Background

3.1   Web Scraping

Web Scraping is the practice of collecting data through any means other than a
person, which is usually a program that interacts with an API. This is generally
achieved by writing an automated program that consults a web server, requests
data and analyzes that data to extract the necessary information.[3]. For data
extraction, there are a set of libraries that help us in this process, among them
are: Jsoup, Scrapy, etc. Scrapy is an open source library that is developed and
works with Python, which generates a structured project, and which is optimized
for Scraping tasks. It can be used for a wide range of purposes, from data mining
to automated monitoring and testing [16].


3.2   Data Mining

Data mining is a discipline that has emerged at the confluence of several other
disciplines, driven primarily by the growth of large databases. The basic mo-
tivating stimulus behind data mining is that one looks for surprising, novel,
unexpected or valuable information, and the goal is to extract this information.
This means that the subject is closely related to the exploratory data analy-
sis. However, problems arising from the size of databases, as well as ideas and
tools imported from other areas, mean that data mining is more than just an
exploratory data analysis. [4].


Data filtering From the set of data collected and already defined the objectives
that we want to achieve, we proceed to choose available data to carry out the
study and integrate them into one that can favor reaching the objectives of
the analysis. Many times this information can be found in the same source
(centralized) or can be distributed [5].
4         Rosario, Julio, Carlos and Marı́a del Carmen

Variable selection The selection of variables is a very important part, even
after having been preprocessed, in most cases there is a large amount of data.
The selection of characteristics reduces the size of the data staying with a vec-
tor of k-dimensions, choosing the most influential variables, without sacrificing
the quality of the knowledge model obtained from the mining process[5]. The
methods for variable selection are [5]:

    – Those based on the choice of the best attributes of the problem.
    – Those looking for independent variables through sensitivity tests, distance
      or heuristic algorithms.


Knowledge Extraction Algorithms Knowledge extraction in databases (KDD)
is ”the non-trivial process of identifying valid, novel, potentially useful and, ulti-
mately, understandable patterns from the data” Data mining only constitutes A
stage of this process whose objective is to obtain patterns and models by apply-
ing statistical methods and machine learning techniques. Finally, the process of
knowledge extraction also involves the evaluation and interpretation of the pat-
terns or models obtained in the data mining stage [10]. Within the knowledge
extraction algorithms, within Machine Learning, we have the following types of
learning:

    – Supervised Learning Supervised learning is a learning model created to
      make predictions, where given a set of input data, your output responses are
      known. [5].
    – Unsupervised Learning Unlike supervised learning, unsupervised learning
      finds certain patterns that exist in the input data, so there is no information
      on the category of the input data [5].
    – Semi Supervised Learning This learning technique is the combination
      of supervised and unsupervised learning. The objective of semi-supervised
      learning is to classify some of the unlabeled data using the set of labeled
      information.


Interpretation and evaluation In this phase of Data mining it is verified
if the results are consistent. Once the learning model is obtained, it must be
validated, checking that the conclusions it produces are valid and sufficiently
satisfactory. If several models are obtained by using different techniques, the
models should be compared in search of the one that best fits the problem [1].


3.3     Dimensionality Reduction

Dimensionality reduction refers to the process of mapping an n-dimensional
point, in a lower k-dimensional space. This operation reduces the size to rep-
resent and store an object or a set of data in general [5]. The dimensionality
reduction is divided into two categories, Selection and Extraction of Character-
istics, where the first one chooses a subset of characteristics with one criterion,
        Search model of educational trends based on Data Mining techniques          5

and the second one transforms the data of high dimension into data of low di-
mension. The reduction is very important since having a large amount of data
and examining text strings, these can become k-dimensional that can cause pro-
cessing to delay.

Ant Colony Optimization Algorithm The reduction is very important since
having a large amount of data. In 1992, Marco Dorigo, in his PhD thesis proposes
an algorithm based on the behavior of ants, in search of food, being its first
application in the problem of the traveler [2]. Ants in the real world wander
randomly in search of food, they are almost blind, so the way to communicate
with each other is through pheromones. By randomly wandering from their nest
to the food source, they leave their pheromone trail until they find their food,
and return to the nest. Since other ants are found around them, they persist in
places that are most traveled by the ant that has found its way to food.

3.4   Support Vector Machine
An SVM (Super Vector machine) is a discriminative classifier formally defined
by a separation plane. In other words, given the training data labeled (supervised
learning), the algorithm generates an optimal hyperplane that categorizes new
examples. In two dimensional spaces, this hyper-plane is a line that divides a
plane into two parts where each class is on each side [13]

3.5   Display
Data visualization is the presentation of data in illustrated or graphic formats.
Allowing people to see the analytics presented visually, so that they can capture
complicated concepts or identify new patterns. With interactive visualization,
you can take the concept one step further using technology to deepen diagrams
and graphs to observe more detail, interactively changing what data you see and
how it is processed [7].

4     Proposal
This section describes the proposal to search for educational trends in research
based on Data mining techniques, below in Fig.1, The whole procedure is shown.

4.1   Data collection
For data collection, the ERIC database has been selected, which provides us
with scientific articles related to the area of Education, this first stage is divided
into three parts:
 – Web Crawling: The website of the following website is inspected [6]
 – Web Scrapping: The information is extracted according to the website.
 – Save information: the extracted information is stored in the database, with
   the following fields: “title” ,“category” ,“year” ,“authors” ,“urlsource” ,“de-
   scription”
6         Rosario, Julio, Carlos and Marı́a del Carmen




                              Fig. 1. Proposed Architecture


4.2      Pre-processing Text
To perform the pre-processing of the Text, the methods of:
    – Tokenization: is the process of segmenting text into called words tokens and
      at the same time also discard punctuation marks.
    – Stop-Words: they are common words of a language like, “the”, “a”, “is”, etc.
      These words are irrelevant in word processing.
    – Semming: This process reduces the words to their root form.
   After performing these steps, we will proceed to apply the algorithm of Term
Frequency - Inverse document Frequency (TF-IDF), with which we will obtain
how relevant each word is in the document, where “t” is the term, “d” the
document and “D” is the set of documents. Applying the multiplication of these
two values will give us a score, the higher the score is then the more relevant is
that word in the document.

                                                                     N
          tf ∗ idf (t, d, D) = log(1 + f req(t, d)) ∗ log(                        )   (1)
                                                             count(d ∈ D : t ∈ d)

    – Term Frequency: the frequency of a term is denoted tf(t, d), it is how frequent
      a term “t” is in the document “d”.
    – Inverse document Frequency: Indicates how common a word is in a whole
      set of documents. It is calculated by taking the total number of documents
      (“N”) and divided by the number of documents that contain a word.

    The pre-processing output will be the ant colony algorithm input, this entry
is of the following form:
    [(’computer’, 0.0651), (’learn’, 0.1789), (’web’, 0.0601)]

Selection of Unsupervised Features For the selection of characteristics, the
bioinspired ant colony algorithm is applied, for this, before carrying out the
characteristic selection process, we will create an unguided graph, denoted by
G = (F, E), where F are the characteristics and E are the edges, to find the
value of the edges the similarity of the cosine between characteristics is used (2).
        Search model of educational trends based on Data Mining techniques          7


                  Pp
                    (ai bi )                                  1
 SA,B = | qP i=1 qP                   | (2)                                       (3)
            p    2           p    2                          SA,B
         (  i=p ai )(        i=p bi )
    Where A and B are two characteristics of dimensionality “p”, according to
the equation the value of similarity ranges between 0 and 1 if the characteristics
are similar, 1 is obtained, otherwise 0. After having the graph, the Ant Colony
Optimization algorithm is applied, this algorithm has two important character-
istics, the first is its “Heuristic Information” and the second is its “desirability”.
The Heuristic Information is defined as the inverse of the similarity between
characteristics, that is (3) and the desirability is the amount of pheromones, this
desirability is denoted as τ .


4.3   Learning model

Once the dimensionality of the feature vector has been reduced, it serves as
input to our algorithm of Clustering, which in our case we are using K-Means,
this in order to find common characteristics, and that can be grouped, to be
able to visualize and interpret the results of the algorithm. And as part of the
verification of the results obtained, we have applied the SVM supervised learning
algorithm, with them we verify that the labels generated for each document have
coherence and their corresponding classification.


5     Results

This section describes the experiments performed applying the clustering algo-
rithm K-Means, to see the grouping of scientific articles related to educational
trends, and the application of the algorithm of SVM, to validate the learning
model.


Database The database used in this work is a compendium of ERIC - Education
[6], where a taxonomy in education has been proposed, based on educational
trends in the year 2019 [18]. To create our database is that we create a pivot
of search start from 20 kinds of Educational Trends, which allows us to obtain
a large number of scientific articles related to education, this was done because
ERIC, can not be performed blank searches.
Parameter Settings:The following proposed parameters have a maximum
number of cycles numCi = 10, the amount of ants will be equal to the num-
ber of threads numHor = numHeb, the initial amount of pheromone for each
characteristic is τi = 0.2, in the same way the evaporation coefficient will be δ
= 0.2, the parameter qini it will be equal to 0.7 with which the exploration and
exploitation value will be controlled, the value of β Indicates the importance
of pheromone. According to the database collected, a maximum number of 50
features will be available.
8         Rosario, Julio, Carlos and Marı́a del Carmen

Results: The results obtained by applying the parallelization of the PUFSACO
algorithm (Parallelization unsupervised future selection based on Ant Colony Op-
timization). The tests were performed on an HP computer, Intel Core i7. The
methods were written in c ++ which runs on Ubuntu 18.04.2 LTS. In the ex-
periments, the ROC curve is used to measure performance. In this work, one
third of the database was used for the test stage (41,160 data). WEKA software
is used to classify the text, for which the SVM algorithm is applied (Support
Vector Machine), the kernel used is polykernel, this algorithm is only used as
validation.




               Fig. 2. Classification applying the SVM algorithm in Weka



    The ROC curve shows the balance between sensitivity (or True positives)
and specificity (1 - False positives). The classifiers that give curves closer to the
upper left corner indicate a better performance, in the tests carried out there is a
ROC of 0.9. As part of the results is that we use the visualization of information
in order to graph in a faster and more concise way the quantity of labels that
were collected referring to a single topic, the visualization technique is known
as word cloud or cloud of labels, you can enter for review in the following link
http://tendenciaseducativas.rf.gd/. Below is the graphic representation of the
educational taxonomy and the tags and their categories, in tagging clouds or tag
cloud, in the same measure the labels, related to education, are displayed.




    [a]                                      [b]

          Fig. 3. a)Behavior of the labels. b)Behavior of educational categories
        Search model of educational trends based on Data Mining techniques         9

   and finally the visualization of the information referring to the years that
were published and their quantity, and linked to the category they belong to.




Fig. 4. Behavior of the information with the relation quantity and time, own elabora-
tion




6   Conclusions

The results of this investigation that in its first stage of Data Collection, Web
Crawling was used to inspect websites and Web Scraping to extract the in-
formation, can be visualized in a graphic repository of educational trends of
type word cloud or tags, which They allow us to better understand how they
were grouped by categories and what relationship they have in the number of
searches with the search year, you can enter for review in the following link
http://tendenciaseducativas.rf.gd/. An important part of the research is also
aimed at reducing dimensionality, using bioinspired algorithms, helping to re-
duce the large amount of information, leaving us with relevant information.The
unsupervised learning application shows us that it can help us discover infor-
mation that does not stand out with the naked eye, but when processed by this
type of algorithm, it allows us to notice more relevant information, and that it
can be applied to the taking of decisions.


7   Acknowledgment

The present research work was carried out within the framework of the research
project IBA-0029-2016 “ Technological Surveillance Services for research centers
and Technological Innovation Classroom, Oriented to the Development of R +
D + I Projects in ICTs and Education ” , we express our deepest gratitude to
the Universidad Nacional de San Agustı́n, for making this study possible.
10      Rosario, Julio, Carlos and Marı́a del Carmen

References
 1. Aparna U.R., Paul, S.: Feature selection and extraction in data mining. In: 2016
    Online International Conference on Green Engineering and Technologies (IC-
    GET). pp. 1–3 (Nov 2016). https://doi.org/10.1109/GET.2016.7916845
 2. Colorni, A., Dorigo, M., Maniezzo, V.: An investigation of some properties of an
    ant algorithm. In: Proc. Parallel Problem Solving from Nature Conference. pp.
    509–520 (1992)
 3. Haddaway, N.R.: The use of web-scraping software in searching for grey literature.
    Grey J 11(3), 186–90 (2015)
 4. Hand, D.J.: Data Mining Based in part on the article “Data mining” by
    David Hand, which appeared in the Encyclopedia of Environmetrics. Ameri-
    can Cancer Society (2013). https://doi.org/10.1002/9780470057339.vad002.pub2,
    https://onlinelibrary.wiley.com/doi/abs/10.1002/9780470057339.vad002.pub2
 5. Herrera, F., Charte, F., Rivera, A.J., Del Jesus, M.J.: Multilabel classification. In:
    Multilabel Classification, pp. 17–31. Springer (2016)
 6. Institute of Education Sciences: Eric, https://eric.ed.gov/
 7. Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern recognition letters
    31(8), 651–666 (2010)
 8. KDnuggets:          Idc        study:      Digital       universe       in      2020,
    https://www.kdnuggets.com/2012/12/idc-digital-universe-2020.html
 9. Klochikhin, E.: Collaborative innovation beyond science: Exploring new data and
    new methods with computer science (2016)
10. Mariñelarena-Dondena, L., Errecalde, M.L., Solano, A.C.: Extracción de
    conocimiento con técnicas de minerı́a de textos aplicadas a la psicologı́a. Revista
    Argentina de Ciencias del Comportamiento 9(2), 65–76 (2017)
11. Meschenmoser, P., Meuschke, N., Hotz, M., Gipp, B.: Scraping scientific web repos-
    itories: Challenges and solutions for automated content extraction. D-Lib Magazine
    22(9/10) (2016)
12. Moscoso-Zea, O., Luján-Mora, S.: Educational data mining: An holistic view. In:
    2016 11th Iberian Conference on Information Systems and Technologies (CISTI).
    pp. 1–6. IEEE (2016)
13. Ramli, M.A., Twaha, S., Al-Turki, Y.A.: Investigating the performance of support
    vector machine and artificial neural networks in predicting solar radiation on a
    tilted surface: Saudi arabia case study. Energy conversion and management 105,
    442–452 (2015)
14. Santos, A.R.P., Carreño, J.D., Pinto, Y.A.S.: Infoxicación y capacidad de filtrado:
    Desafı́os en el desarrollo de competencias digitales. Etic@ net 18(1), 102–117 (2018)
15. Schindler, L., Puls-Elvidge, S., Welzant, H., Crawford, L.: Definitions of quality in
    higher education: A synthesis of the literature. Higher Learning Research Commu-
    nications 5(3), 3–13 (2015)
16. Scrapy: Scrapy, https://scrapy.org/
17. Srivastava, M., Garg, R., Mishra, P.K.: Analysis of data extraction and
    data cleaning in web usage mining. In: Proceedings of the 2015 Inter-
    national Conference on Advanced Research in Computer Science Engineer-
    ing & Technology (ICARCSET 2015). pp. 13:1–13:6. ICARCSET ’15,
    ACM, New York, NY, USA (2015). https://doi.org/10.1145/2743065.2743078,
    http://doi.acm.org/10.1145/2743065.2743078
18. teachthought:      30     of    the   most     popular     trends    in    education,
    https://www.teachthought.com/the-future-of-learning/most-popular-trends-
    in-education/