<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>2. BIBLIOMETRIC TECHNIQUES
Bibliometrics is the use of statistical methods for the analysis of
journal articles</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Differentiating between Educational Data Mining and Learning Analytics: A Bibliometric Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stevens Dormezil</string-name>
          <email>stevens.dormezil@palmbeachschools.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taghi Khoshgoftaar</string-name>
          <email>khoshgof@fau.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federica Robinson-Bryant</string-name>
          <email>robinsof@erau.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Research and Evaluation</institution>
          ,
          <addr-line>3300 Forest Hill Boulevard, West Palm Beach, FL 33406</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Embry-Riddle Aeronautical University</institution>
          ,
          <addr-line>600 South Clyde Morris Boulevard, Daytona Beach, FL 32114-3900</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Florida Atlantic University</institution>
          ,
          <addr-line>777 Glades Road, Boca Raton, FL 33431</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Educational Data Mining and Learning Analytics are two relatively new research fields. Natural language techniques can be used to identify major research themes within each field. Similarities and differences between both domains are identified through the use of keyword analysis. Over 4,000 articles are analyzed and bibliometric techniques are used to select 60 articles that best represent major research themes within the intersection, as well as disjoint elements of both fields. Following keyword analysis, we conclude it is more accurate to describe what appears to be two domains (i.e. Educational Data Mining and Learning Analytics) as one domain (i.e. Learning Analytics) with one prominent subset (i.e. Educational Data Mining).</p>
      </abstract>
      <kwd-group>
        <kwd>educational data mining</kwd>
        <kwd>learning analytics</kwd>
        <kwd>bibliometrics</kwd>
        <kwd>big data</kwd>
        <kwd>machine learning</kwd>
        <kwd>natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        one of three ways: inter-citation counts, co-citation counts, or
bibliographic coupling frequencies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Inter-citation counts
represent the frequency two objects have cited each other.
Cocitation counts track the number of documents that cite two works
together. Bibliographic coupling frequencies measure the number
of cited references that two works have cited together. The overall
number of times a piece of scientific literature is cited as well as
its relationship relative to other cited scientific literature can be
used to determine that literature’s overall impact within the
scientific community.
      </p>
      <p>Yet, this work focuses on the conceptual structure of the EDM
and LA domains and therefore applies methods using keyword
cooccurrences among the bibliographic collections. Through
dimensionality reduction techniques such as Multidimensional
Scaling (MDS), Correspondence Analysis (CA), or Multiple
Correspondence Analysis (MCA), interactions within and across
each topic unveil clusters of items that express common concepts
(or thematic areas). This is accomplished by co-word analysis,
where themes are determined by keywords and converted into
clusters of keywords (or sublists). The definition of ‘keyword’
varies based on the needs of the researcher but can be limited to
title of items, author keywords explicitly identified, abstracts, or
combinations of the three determined by the database used. This
work adapts the understanding that keywords and encompass all
three fields.</p>
      <p>Results from co-word analysis can be plotted on a
twodimensional map called a thematic map in order to visualize
relationships among clusters. The conceptual structure depicted
on such visualizations can show topics covered by researchers,
relative similarity to other works, relative importance to the field,
and the evolution of topics over a given period. Similar insights
can be drawn from network diagrams.</p>
      <p>
        Both thematic networks and citation graphs are natural visual
representations of the respective networks. Network
multiplication can be used to derive various network types [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Thematic networks are depicted by two-mode networks that
represent links between a set of keywords and a set of
corresponding works. Two-mode networks of this type can be
represented with a matrix and can be computed through
matrix multiplication, such as . Other two-mode
networks can be constructed, such as (works by authors),
(works by journals), and (works by classification) in a
similar manner. Additional unique network types can be derived
through matrix multiplication, such as which gives
the two-mode network of authors by journals.
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. METHODOLOGY</title>
      <p>
        There are basic steps that have been generalized for conducting
bibliometric studies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Figure 1 captures the five-step approach
adapted in this study.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3.1 Data Source</title>
      <p>All bibliographic information analyzed in this article was
collected from the largest peer-reviewed citation database,
Scopus. This abstract and citation database draws from three main
sources: scientific journals, books and conference proceedings
and integrates a patent database within the platform. Scopus is
maintained by Elsevier and integrates titles from more than 5,000
publishers, with more than 20,000 serial titles, 150,000 books,
and more than 70 million items. It was chosen as the most
comprehensive data collection source available for the purpose of
this research.</p>
    </sec>
    <sec id="sec-4">
      <title>3.2 Data Collection</title>
      <p>During the week of March 31 to April 6, 2019, three searches
were conducted in the Scopus database. One search consisted of
the term “learning analytics.” The retrieved items were considered
members of the LA dataset. A second search consisted of the
terms “educational data mining.” The retrieved items were
considered members of the EDM dataset. Finally, a third search
consisted of the two previous search terms combined with the
Boolean operator AND. Items from this final group were
considered members of the joint LA &amp; EDM dataset. Table 1
gives summary statistics for the initial results of each search.</p>
      <p>LA
3008
736
6899</p>
      <p>EDM
1351
600
3892</p>
      <p>
        LA &amp;
EDM
295
151
1128
4885
2453
695
All data files were downloaded as BibTeX bibliography files
inclusive of all available Scopus data related to each item.
Reference entries were stored in a style-independent, text-based
file format [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] similar to the example entry in Figure 2.
@Book{todeschini+baccini
author = “Robert {Todeschini} and Alberto {Baccini}”,
title = “Handbook of bibliometric indicators :
quantitative tools for studying and evaluating research”,
publisher = “Wiley-VCH Verlag”,
year = 2016,
address = “Weinheim, Germany”,
edition = “First”
}
In each data set, entries with missing authors were removed.
These entries generally correspond to conference proceeding
papers that summarized collections of papers presented at that
associated conference. Also, documents that were within the LA
and joint LA &amp; EDM datasets, as well as the EDM and joint LA
&amp; EDM datasets were removed to ensure that all three datasets
were mutually exclusive. Documents that were missing keyword
terms were also eliminated, as two-mode networks generated from
works by keywords were to be used as the primary tool for
generating thematic maps [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Search terms such as ‘learning
analytics’ and ‘educational data mining’ were removed from each
respective dataset to limit the chance of formulating dominant
clusters that were centered around search terms.
      </p>
      <p>Following the processing of all three data sets, 1,952 documents
remain in the LA dataset (-35%), 783 observations in the EDM
dataset (-40%), and 226 observations in the joint EDM and LA
datasets (-20%). Table 2 gives summary statistics for the initial
results of each search.</p>
      <p>LA
1952
438
6436
3645
3962
7068
122
3840
136
0.403
2.48
3.33
2.6</p>
    </sec>
    <sec id="sec-5">
      <title>3.3 Theme Identification</title>
      <p>
        The primary focus of the bibliometric analysis conducted within
this research was to identify the key research themes within the
fields of LA and EDM. Following preparation of the data sets,
the R package, Bibliometrix, was used to further process raw
BibTex files, perform co-word analysis and generate thematic
maps [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Key areas of research focus within each respective
dataset were identified as clusters formed from keywords
extracted from each dataset. Clusters are identified by co-word
analysis where keywords that occur frequently together within a
research domain are grouped together. Clustering also shows
subgroups of keywords that are linked to each other and the
degree of those relationships. Therefore, the final clusters
selected as representative samples for the three datasets were
those clusters that demonstrate a higher relative degree of density
and centrality when compared to other clusters. Centrality
represents a cluster’s relative interaction with other clusters, while
density represents the relative interaction of members within a
cluster.
      </p>
      <p>Parameters were selected to maximize the amount of clusters
generated from keywords within each of the three datasets. The
relative frequency of the occurrence of a keyword within a cluster
was set to three for all three domains: LA, EDM, and the
combined LA &amp; EDM dataset. The number of keywords were
varied from 0 to the maximum amount of keywords in a
collection’s dataset by increments of 50. This process enabled
optimization of the selected parameters in order to maximize the
number of keyword clusters in each collection.</p>
    </sec>
    <sec id="sec-6">
      <title>3.4 Theme Visualization</title>
      <p>
        Resulting themes were then mapped to a visualization
demonstrated by Cobo et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], namely the thematic map.
Clusters enable visualization relative to one another based on
density and centrality in regions of the diagram in Figure 3. Once
plotted, the largest clusters located in Quadrant I (labeled
clockwise), or the Highly Developed, Motor Themes quadrant,
were selected as the representative themes of the subject area
domain. These themes have a high density and high centrality,
and are the fundamental themes of the field.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.5 Item Selection</title>
      <p>
        Following the selection of the clusters as the major themes for this
collection of documents within each domain, a “bag of words
model” was used to help select items that should be considered as
most influential for the field [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A universal vector was created
using all possible keywords in the collection of articles for each
collection dataset. Subsequently, each thematic cluster and each
document was converted to a vector by completing a one-hot
encoding of the keywords within a cluster or the keywords
associated with a document. Each document was compared to the
thematic cluster, and a cosine similarity score was generated to
assess how similar the documents were.
      </p>
      <p>
        To select the representative sample of items within a thematic
cluster, a combination of cosine similarity scores and total
citations per item was utilized. Items with the highest cosine
similarity score were selected within each cluster and represent
those in the “core zone” described by Bradford’s law [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Bradford’s law serves as a good rule of thumb for describing the
exponentially diminishing returns of retrieving articles for a given
domain within a database. The three zones of Bradford’s law are
show in Figure 4.
      </p>
    </sec>
    <sec id="sec-8">
      <title>4. RESULTS</title>
      <p>Six representative keyword clusters were extracted from the three
datasets, two from each dataset. Words that represent each cluster
for the selected keyword clusters are indicated in Table 3 – 5 and
grouped according to one of four categories. ‘Analysis/Tools’
captures keywords related to the detailed examination or
processing of inputs or the application of specific devices,
software, or methods to perform particular analysis functions.
‘Context (Environment)’ features keywords related to the
conditions that influence the setting where work is performed
while ‘Context (Target Group)’ encompasses keywords related to
specific individuals, groups, or levels. The final category,
‘Teaching/Learning’ features keywords related to the act
of teaching (or providing instruction to learners) or learning
(acquiring knowledge and skills by studying, experiencing, or
being taught)
There are several keywords present across multiple datasets. For
example, the keyword ‘education computing’ is shared between
the LA dataset and the combined LA &amp; EDM dataset. The
keyword ‘student performance’ is shared between the dataset of
EDM articles and the combined LA &amp; EDM dataset.</p>
      <p>Also noteworthy, words within the two LA keyword clusters focus
primarily on instruction and communication. Within the first
cluster, keywords such as “curricula”, “mobile learning”,
“ontology”, and “conceptual frameworks” dominate. However,
the second cluster of LA, focuses on words such as “natural
language processing”, “computational linguistics”, and
“linguistics”. Other notable terms within this cluster include
“students’ behaviors” and “knowledge building.”
Within the EDM word cluster, the focus appears to be more on
student performance and technical details for methods of
predicting performance. Within the first cluster, keywords such as
“recommender systems”, “cognitive tutors”, and “factorization,”
appear. Within the second cluster, reinforcement of many of these
same themes in keywords such as “association rules”, “supervised
learning”, and “decision support systems” are present.
algorithms, association rules,
classifiers, data sets, factorization,
information management, matrix
algebra, matrix factorizations,
recommender systems, supervised
learning
cognitive tutors, decision support
systems, knowledge management,
learning systems
student models, students, students'
performance, student's performance,
university students
N/A
Keywords
computational linguistics, natural
language processing, natural
language processing systems,
statistics
Computer-aided instruction,
information systems, mobile
applications, mobile learning,
ubiquitous learning
students' behaviors
assessment, conceptual frameworks,
curricula, design, education
computing, information science,
knowledge building, learning
dispositions, linguistics, ontology
Keywords were further divided into four categories:
analysis/tools, context (environment), context (target group), and
teaching/learning. The category context is those keywords that are
related to specific individuals, groups, or levels. Context
(environment) is keywords related to the condition that influences
the setting where learning or academic work is performed. The
category teaching/learning are those keywords that are related to
the act of teaching (e.g. providing instruction to learners) or
learning (e.g. acquiring knowledge and skills by studying,
experiencing or being taught). And finally, the category
analysis/tools is keywords related to the detailed examination or
processing of inputs or the application of specific devices,
software, or methods to perform particular analytic functions. The
categories context (target group) and analysis/tools are more
closely aligned with the definition of Educational Data Mining
while the categories context (environment) and teaching/learning
are more closely related to Learning Analytics. In Table 3, 14 of
the 19 keywords extracted from the EDM dataset are associated
with EDM categories. In table 4, 11 of 20 keywords extracted
from the LA dataset associated with LA categories. In Table 4, 10
of 13 keywords extracted from the combined LA &amp; EDM dataset
are associated with EDM categories. Although the keywords with
the EDM and combined LA &amp; EDM datasets differ, it is clear that
when viewed from a framework of larger encompassing categories
that the EDM and LA &amp; EDM datasets share more similarities
than differences.</p>
      <p>Ultimately, 43 items were identified in the core zone for each
cluster. After being ranked based on its influence (or total
citations), the top item from each cluster was identified and
captured in Table 6.
performance prediction model
through interpretable genetic
programming: integrating learning
analytics, educational data mining
and theory
MOOCS: So many learners, so
much potential</p>
    </sec>
    <sec id="sec-9">
      <title>5. CONCLUSION</title>
      <p>Following keyword cluster extraction, it becomes clear that major
research themes within LA focus primarily on student-focused
learning objectives. Keywords such as “curricula”, “student’s
behaviors”, and “knowledge building” suggest a focus on using
technology to help students learn and understand how they learn.
There is also emphasis on words within the natural language
learning domain such as “linguistics”, “computational
linguistics”, and “natural language processing” that suggest an
emphasis on bridging the gap between human and computer
interaction through natural language processing.</p>
      <p>The keyword clusters for EDM suggest a focus on the algorithms
behind predicting student performance. Keywords such as
“student performance”, “cognitive tutors”, “student models”, and
“learning algorithms” are aligned with this focus. Other keywords
such as “recommender systems” and “performance classifiers”
further suggest EDM’s focus on predicting student preferences
based on their performance.</p>
      <p>Within the joint LA and EDM, the keywords such as “regression
analysis”, “linear regression”, and “predictive modeling” highlight
common algorithms within both domains. This focus on
algorithms within the intersection of LA &amp; EDM is similar to the
focus and direction of major themes in EDM. With this
consideration, it appears that EDM can be considered a subset of
LA. What appeared to be two domains with a significant amount
of overlap can be better described as one domain (i.e. LA) with
one prominent subset (i.e. EDM). Figure 5 depicts both
relationships.</p>
    </sec>
    <sec id="sec-10">
      <title>6. FUTURE WORK</title>
      <p>Managing bias is a concern when conducting literature reviews,
yet the application of a bibliometric approach helps to mitigate
such risks. Future work could include literature reviews based on
the items chosen by this bibliometric approach or adaption of such
approach to alternative topics. Further information about the
distinction between LA and EDM based on research production
can be gleaned from the preliminary findings herein. Also, while
the technique outlined in this paper focuses on identifying key
concepts within Quadrant I of the thematic map, it could be
modified to explore other quadrants that focus on emerging
themes or isolated themes, i.e. Quadrants III and IV respectively.
The identification of promising research areas can be beneficial
for active researchers. Furthermore, exploration of the evolution
of keyword themes over time is possible by dividing the same data
into different consecutive groups of years for analysis over time.
Finally, more advanced natural language techniques, such as
word2vec models could be utilize to generate more robust results
overall.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bornmann</surname>
          </string-name>
          , “
          <article-title>Do altmetrics point to the broader impact of research? An overview of benef</article-title>
          ...: .,
          <string-name>
            <given-names>” J.</given-names>
            <surname>Informetr</surname>
          </string-name>
          ., vol.
          <volume>8</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>2014</year>
          .Ding,
          <string-name>
            <given-names>W.</given-names>
            and
            <surname>Marchionini</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>1997</year>
          .
          <article-title>A Study on Video Browsing Strategies</article-title>
          .
          <source>Technical Report</source>
          . University of Maryland at College Park.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hurter</surname>
          </string-name>
          , “
          <article-title>Analysis and Visualization of Citation Networks</article-title>
          ,” Synth. Lect. Vis., vol.
          <volume>3</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>127</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Batagelj</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Cerinšek</surname>
          </string-name>
          , “On bibliographic networks,
          <source>” Scientometrics</source>
          , vol.
          <volume>96</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>845</fpage>
          -
          <lpage>864</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cobo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>López-Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Herrera-Viedma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          , “
          <article-title>An approach for detecting, quantifying, and visualizing the evolution of a research field: A practical application to the Fuzzy Sets Theory field</article-title>
          ,” J. Informetr., vol.
          <volume>5</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>166</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fenn</surname>
          </string-name>
          , “
          <article-title>Managing Citations and Your Bibliography with \bibtex,” Pr</article-title>
          . J., vol.
          <volume>1</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Aria</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Cuccurullo</surname>
          </string-name>
          , “
          <article-title>bibliometrix: An R-tool for comprehensive science mapping analysis,”</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Informetr</surname>
          </string-name>
          ., vol.
          <volume>11</volume>
          , no.
          <issue>4</issue>
          , pp.
          <fpage>959</fpage>
          -
          <lpage>975</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Mao</surname>
          </string-name>
          , “
          <article-title>Fuzzy Bag-of-Words Model for Document,”</article-title>
          <source>IEEE Trans. Fuzzy Syst.</source>
          , vol.
          <volume>26</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>794</fpage>
          -
          <lpage>804</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Desai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Veras</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Gosain</surname>
          </string-name>
          , “
          <article-title>Using Bradford ' s law of scattering to identify the core journals of pediatric surgery,”</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Surg</surname>
          </string-name>
          . Res., vol.
          <volume>229</volume>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>95</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>