=Paper= {{Paper |id=Vol-2734/paper6 |storemode=property |title=Semantic Textual Similarity of Course Materials at a Distance-Learning University |pdfUrl=https://ceur-ws.org/Vol-2734/paper6.pdf |volume=Vol-2734 |authors=Niels Seidel,Moritz Rieger,Tobias Walle |dblpUrl=https://dblp.org/rec/conf/edm/SeidelRW20 }} ==Semantic Textual Similarity of Course Materials at a Distance-Learning University== https://ceur-ws.org/Vol-2734/paper6.pdf
         Semantic Textual Similarity of Course Materials at a
                   Distance-Learning University

                      Niels Seidel                              Moritz Rieger                        Tobias Walle
               FernUniversität in Hagen                    FernUniversität in Hagen            FernUniversität in Hagen
              niels.seidel@fernuni-                    moritz.rieger@posteo.de              tobias.walle@student.fernuni-
                     hagen.de                                                                       hagen.de


ABSTRACT                                                                   and interests. However, only module handbooks and course
Choosing computer science courses from a wide range of                     websites are usually available for decision-making purposes.
courses is not an easy task for students - especially in the first         Learning materials published in advance as textbooks or in
semesters. To overcome the shortcomings of course descrip-                 the sense of OER are the exception. In both cases, how-
tions and vague recommendations by acquaintances, we pro-                  ever, the amount of information is difficult to manage. The
vide a method to identify and visualize semantic similarities              linear format of the module manuals, which often contain
between courses using textual learning materials. To achieve               more than one hundred pages, makes it difficult to iden-
this goal, a complete set of course materials (94 courses, 572             tify courses that are similar in content and build on each
course units / PDF textbooks) from the Faculty of Mathe-                   other. Moreover, the concise descriptions of the modules
matics and Computer Science at FernUniversität in Hagen                   represent only a fraction of the learning content. Courses
was vectorized as document embeddings and then compared                    that are not assigned to the course of study will, of course,
using the cosine similarity of the vectors. The process can                not appear in the module handbook. Many students there-
be fully automated and does not require labeled data.                      fore seek advice from friends and fellow students or follow
The results were compared with the semantic similarity as-                 recommendations from teachers. However, prospective stu-
sessed by domain experts. Also the similarity of consecutive               dents and first-year students do not yet have these contacts.
courses and sections within the same course have been eval-                This challenge becomes particularly clear when looking at
uated against the average similarity of all courses.                       the example of the FernUniversität in Hagen. With over
The presented approach has been integrated into a course                   74,000 registered students and a course offering of over 1,600
recommendation system, a course dashboard for teachers                     courses, the distance-learning university is the largest uni-
and a component of an adaptive learning environment.                       versity in Germany. The Faculty of Mathematics and Com-
                                                                           puter Science alone accounts for 134 courses, of which 94
                                                                           are courses and 40 are seminars or internships. For students
Keywords                                                                   at the faculty, choosing from this large number of courses is
NLP, Semantic Textual Similarity, Document Embedding,                      a particular challenge. In contrast to attendance universi-
Educational Data Mining                                                    ties, it is usually not possible to benefit from the experience
                                                                           of fellow students. Furthermore, the authors or supervisors
1.    INTRODUCTION                                                         are usually not personally known to the students, so that
Before each semester, students are faced with the question                 contacts with lecturers can hardly make the decision easier.
of which course to take. In order to achieve the goal within               Students can use the short descriptions of course contents
a course of study, the examination regulations contain infor-              and learning goals in the module manuals (approx. 100-200
mation on the optional and compulsory modules and courses.                 words) as well as short readings of one chapter of the script
Study plans of the Student Advisory Service flank this frame-              for decision-making. For universities with a very large num-
work with recommendations on the number, sequence and                      ber of courses and a very wide range of options, the planning
selection of courses for the individual semesters. Ultimately,             of the study program is therefore time-consuming and com-
the dates of the courses result in further organizational re-              plex.
quirements with which the individual timetable must be
brought into line. Despite these organizational restrictions,              Teachers who wish to avoid redundancies to other courses
the internal autonomy of the universities opens up many                    and who wants to build on previous knowledge or develop
options for selecting courses according to content criteria                the same for other courses when planning and creating learn-
                                                                           ing materials face a similar hurdle. In view of the large
                                                                           number of courses, however, the people concerned do not
                                                                           always know exactly what their colleagues teach in detail in
                                                                           their courses. Consequently, overlaps and gaps in content
                                                                           remain undetected and potential for cooperation in the field
                                                                           of teaching is not recognized.

                                                                           In this paper a method for the analysis of semantic similarity
Copyright c 2020 for this paper by its authors. Use permitted under Cre-   of courses using text-based learning materials is presented.
ative Commons License Attribution 4.0 International (CC BY 4.0).
In the second section, related works regarding methods for         Brackhage et al. had experts manually keyword module de-
determining the semantic similarity of texts as well as on         scriptions of several universities and visualized these data to-
course selection recommendation systems will be presented.         gether with further metadata in a web application as a forced
Subsequently, the method document embeddings used here             layout graph and adjacency matrix heatmap and made them
for the analysis of semantic similarities is presented in sec-     searchable with the help of complex filters [7]. However, key-
tion 3 using the example of a corpus of 94 courses of the          wording proved to be extremely time-consuming and has to
FernUniversität in Hagen. The results will be evaulated in        be updated frequently. Baumann, Endraß and Alezard used
section 4. Based on the semantic relations of the course ma-       study history data to visualize “on the one hand the distri-
terials, we present three prototypical applications in section     bution of students across the modules in a study program
5: i) a tool for exploration and recommondation of courses,        and on the other hand the distribution of students in a mod-
ii) a teacher dashboard, and ii) an adaptive course recomon-       ule across different study programs” [4], without, however,
dations for long study texts. The article ends with a sum-         concretizing their benefit for the intended support in the
mary and an outlook.                                               choice of courses. Askinadze and Conrad used examination
                                                                   data from one study program to illustrate the progress and
2.   RELATED WORKS                                                 discontinuation of studies in various visualisations [3]. How-
                                                                   ever, there is a large number of applications and approaches
The processing of natural language using Natural Language
                                                                   to recommend courses to students. Lin et al. used the sparse
Processing (NLP) techniques has made enormous progress
                                                                   linear method to develop topN recommendations based on
in recent years. Conventional NLP methods generate from
                                                                   occupancy data of specific groups of students [19]. With
a text document by Bag of Words (BOW), frequency-based
                                                                   the help of K-Means and Apriori Association Rules, Aher
methods like Term Frequency Inverse Document Frequency
                                                                   and Lobo presented a recommendation system for courses
(TF-IDF), Latent Dirichlet Allocation (LDA) etc. vectors
                                                                   in the learning management system Moodle [2]. Zablith et
and calculate the distance between the vectors [19]. How-
                                                                   al. present several recommendation systems based on linked
ever, these methods cannot capture the semantic distance
                                                                   data from the Open University UK1 [29]. The Social Study
or are very computationally intensive [21] and usually do
                                                                   application, for example, suggests courses to learners based
not achieve good results. Newer machine learning methods
                                                                   on their facebook profile, while Linked OpenLearn offers me-
achieve much better results in the analysis of semantic text
                                                                   dia and courses to the distance learning university’s OER to
representations [10]. A central challenge is the determina-
                                                                   learners. The recommendations are based on course-related
tion of Semantic Textual Similarity (STS), which is used
                                                                   metadata and links to other courses and media, but do not
in machine translation, semantic search, question-answering
                                                                   consider the semantics of the courses. D’Aquin and Jay try
and chatbots. With the help of developments in the field of
                                                                   to reconstruct the missing semantics with the help of differ-
distributed representations, especially neural networks and
                                                                   ent linked data sources (e.g. DBpedia) in order to trace fre-
word embeddings such as Word2Vec [21] and Glove [23], se-
                                                                   quently occurring course occupancy (frequency sequences)
mantic properties of words can be mapped into vectors. Le
                                                                   [11]. The analysis of semantic similarities on the basis of
and Mikolov have shown with Doc2Vec that the principles
                                                                   the complete learning materials does not only provide in-
used can also be applied to documents[17].
                                                                   sights into the content relations of learning resources but
                                                                   also opens up the possibility to understand the temporal
The similarity of extensive book collections has so far been
                                                                   structure and patterns of course assignments for the deci-
investigated in only a few works. The SkipThrough Vectors
                                                                   sion making process when choosing a course.
presented by Kiros et al. train an encoder-decoder model
that attempts to reconstruct the surrounding sentences of an
                                                                   From the perspective of course planning, Kardan et al. have
encoded passage [16]. However, the experiments were based
                                                                   developed a prediction model for the number of course book-
on a relatively small body of only 11 books [30]. Spasojevic
                                                                   ings with the help of a neural network [15]. Ognjanovic
and Pocin, on the other hand, determined the semantic sim-
                                                                   et al. have also modeled the course occupancy for several
ilarity at the level of individual pages and entire books for
                                                                   semesters in advance [22]. However, the authors of this pa-
the corpus of Google Books, which contains about 15 million
                                                                   per could not find any contributions in the literature for
books [26]. The similarity of two books was determined from
                                                                   a didactically motivated use of occupancy statistics. The
the Jaccard index of the permuted hash values of normalized
                                                                   same applies to the use of these data for the modeling of
word groups (features). Liu et al., however, point out that
                                                                   learners within adaptive or at least personalized learning
the semantic structure of longer documents cannot be taken
                                                                   environments.
into account in this way and therefore propose the represen-
tation as a Concept Interaction Graph [20]. Keywords are
determined from a pair of documents and combined into con-         3.     DETERMINATION OF THE SEMANTIC
cepts (nodes) using community detection algorithms. These                 SIMILARITY OF TEXTS
nodes are connected by edges that represent the interactions       In this section, a procedure for analyzing the semantic sim-
between the nodes based on sentences from the documents.           ilarity of courses using text-based learning materials is pre-
Although the method seems very promising, it has so far            sented using the example of the study texts of the Faculty
only been investigated on the basis of news articles. The          of Mathematics and Computer Science of the FernUniver-
SemEval-2018 Task 7 [12] pursues a similar goal as this pa-        sität in Hagen. This procedure consists of four steps, which
per, for example, with regard to STS, where semantic rela-         are mainly based on the work of [blinded]. First a corpus
tions from abstracts of scientific articles are to be found. The   of course materials is created. Then these data are vector-
gold standard used for evaluation is based on named entities       ized to determine the similarity in the third step. Finally,
(persons, places, organizations), which cannot be annotated
                                                                   1
with reasonable effort for large amounts of text.                      See http://data.open.ac.uk/ (last accessed 15.06.2020).
an evaluation with a gold standard and other comparison          Fig. 1). Since the word vectors capture the semantics of the
parameters is carried out.                                       words as an indirect consequence of the estimation task, this
                                                                 is done in a similar way with Document Embeddings. One
A corpus is a collection of related documents. In order to       can imagine the training of the PV as the training of another
create a corpus, source data of 94 courses from all 20 subject   word. A PV acts as a kind of memory that contains informa-
areas of the faculty were available. A course consists of 3 to   tion about missing words in the context within a document.
10 documents, that we call course units or units. The course     For this reason this model is also called Distributed Mem-
units were available as PDF documents that have between          ory Model of PV. Building on the Word Embeddings, PVs
20 and 60 pages. The PDFs differed in terms of their format      have been trained to represent entire documents. Now the
(e.g. PDF/A, PDF/X), the PDF versions and the tools used         PVs can be processed as characteristics of the documents
to create them. The formatting of the type area was also not     to recognize semantic similarities of the documents. Before
uniform. For these reasons, a programmatic extraction of         the documents are compared with each other, the semantic
chapters using regular expressions and PDF outlines proved       similarity of texts is first examined in general. To find a
to be unreliable and had to be discarded. The cover pages        commonality of all terms, similarity has to be thought of as
as well as redundant tables of contents and keyword indexes      a “complex network of similarities” [28] of different entities.
within a course were removed. The PDF documents were             This complex form of similarity of natural language means
therefore first converted to text and divided into sentences     that two documents cannot be considered semantically sim-
and words using NLTK [5]. To avoid errors with mathemat-         ilar on the basis of common features, but that similarity is
ical formulas and dotted lines in the table of contents, the     to be understood as the interaction of many direct and in-
text was also normalized. The resulting corpus contains 572      direct relationships between the words contained in them
course units, consisting of 654,367 sentences with a total of    [13]. This concept of similarity is taken into account in the
9,507,770 words. The vocabulary contains 179,078 different       training of word embeddings. The weight matrix, which ul-
words. Document Embeddings, also called Paragraph Vec-           timately contains the word embeddings, is the result of the
tors (PV) by Le and Mikolov, were used to vectorize the          use of the words in all contexts of the entire text corpus and
documents [17]. Since document embeddings are based on           thus represents the complex network of similarities described
word embeddings, they must be created first. For this pur-       by Wittgenstein. To compare Distributed Representations,
pose, the words in one-hot-encoding enter a neural network.      the cosine similarity is usually used as a metric [27, 13]. For
This serves an estimation task, whereby the word most likely     normalized PV there is a linear relationship to the Euclidean
to be in the context of a word is to be estimated. The neural    distance.
network is trained with tuples from the text. For this pur-
pose, a window is pushed through the entire corpus and the       4.   EVALUATION OF NLP SYSTEMS
resulting tuple combinations are noted within the window.
                                                                 Since vectorizing the documents as PV is an Unsupervised
By feeding back the estimation error into the re-estimation,
                                                                 Machine Learning method, there is no underlying test data
the weights of the weight matrix W are optimized. This has
                                                                 against which the system can be tested. Following the Se-
the consequence that the weights of the estimation task for
                                                                 mEval competitions, a gold standard was therefore devel-
semantically close words assume similar values, since com-
                                                                 oped, which consists of a set of test and training data. How-
parable tuples were used in the training. The weights of the
                                                                 ever, this gold standard could not be generated by crowd-
estimation task represent the word embeddings.
                                                                 sourcing [1], as is the case with many SemEval tasks, since a
                                                                 high degree of competence in the respective fields of knowl-
                                                                 edge is required for the assessment of semantic similarity.
                                                                 For this reason, three experts, which are authors of course
                                                                 texts themselves, were asked to compare one of their own
                                                                 courses with a course that they thought was similar. As
                                                                 an incidence they selected 6 unique courses. By evaluat-
                                                                 ing documents that are related in terms of topic or content,
                                                                 monotonous gold standards that do not show any similari-
                                                                 ties could be avoided. Each of the three experts evaluated
                                                                 two courses which consisted of 4 and 7 units each. Each
                                                                 evaluator had thus made 28 comparisons. The similarity
                                                                 was indicated on a continuous scale from 0 (not similar) to
                                                                 100 (identical). A nominal gradation of the scale was omit-
                                                                 ted due to expected problems of understanding with regard
Figure 1: Continuous Bag-of-Words model as well
                                                                 to the valence and equidistance of the scale values. Half of
as the paragraph ID (orange), which is included in
                                                                 the gold standard data was used for training different hy-
the estimate of w(t) in addition to the context words
                                                                 perparameters, as shown in Fig. 3. The hyperparameters
for document embeddings.
                                                                 were composed of the window size of the Continuous Bag
                                                                 of Words, the dimension of the PV and the minimum fre-
                                                                 quency of occurrence of the words considered. The values
In order to be able to represent whole documents semanti-        for the individual parameters are based on plausibility tests
cally as vectors, the idea of word embeddings is extended        and are within the value ranges known from literature (e.g.
to whole texts. For this purpose, a paragraph vector, a col-     [21]). To avoid overfitting, the values for each parameter
umn of another weight matrix D, is combined with word            were only roughly graded. The minimum mean square error
vectors to estimate the next word from a given context (see      could be determined for a window size of 20, a dimension of
                                                                In order to test the first hypotheses, eight courses were ini-
                                                                tially identified which, given the numbering contained in the
                                                                course title, clearly build on each other. The mean cosine
                                                                similarity of the consecutive courses is 0.32, which is above
                                                                the average of the whole corpus (0.18). Hypothesis 1 is thus
                                                                confirmed. The second hypothesis could already be recog-
                                                                nized by the strongly colored rectangular artifacts along the
                                                                diagonals in the adjacency matrix in Fig. 2. The mean simi-
                                                                larity of course units is 0.51 and is thus significantly greater
                                                                than the mean cosine similarity of the whole corpus (see
                                                                Fig. 5). Hypothesis 2 is therefore also confirmed. A further
                                                                part of the plausibility check consisted, among other things,
                                                                in excluding undesired effects of the document size on the
                                                                semantic similarity. There is no correlation between the dif-
                                                                ference in the word count of two documents and their cosine
                                                                similarity (r = 0.013).

                                                                5. APPLICATIONS
                                                                5.1 Course exploration and recommendation
                                                                The hurdles in the choice of courses addressed in the intro-
                                                                duction to this paper address an application in which learn-
Figure 2: Adjacency matrix of course unit relations.            ers can explore the semantic similarity of courses and course
Courses are represented by a running number along               units by means of visualizations in the form of chord dia-
the axis. The darker the boxes, the greater the se-             grams, forced layout graphs and heat maps. These node-link
mantic similarity. The dark colored rectangular arti-           diagrams are primarily suitable for small graphs, since the
facts along the diagonals indicate the high similarity          visualization quickly becomes confusing due to overlapping
of units of the same course.                                    edges. Heatmaps in particular, may contain many nodes,
                                                                but require a lot of space. Their readability depends largely
                                                                on the arrangement of the elements. Due to this limitation,
PV of 140 and a value of 20 as the lower limit for the fre-     it seemed necessary to realize the exploration over the en-
quency of occurrence of words (see solid blue dot in Fig. 3).   tire set of courses not graphically, but textually. Besides
Based on these hyperparameters a model was trained and          the given structuring of the courses according to study pro-
tested with the second part of the gold standard (test data).   grams, chairs and lecturers, we tried to identify overlapping
Pearson’s r as a measure of the linear relationship reached     topics. Using Latent Dirichlet Allocation 11 topics were de-
a value of 0.598. However, since in the present case whole      termined based on the word distribution [24]. For each topic
documents were compared instead of individual sentences,        the 20 most weighted terms were displayed in a word cloud.
the gold standard and the PV are more fuzzy. Fine fluc-         After the user has made pre-selection (e.g. by choosing a
tuations in cosine similarity are not reflected in the gold     topic), a limited set of up to 20 courses including their course
standard. However, in view of the subjective assessment         units can be explored. For this pupose various interactive
on the continuous scale, which can be freely interpreted by     node-link diagrams were created as Data Driven Documents
the evaluators, the value for Pearson’s r must be regarded      [6].
as high. To establish the monotonous relationship between
cosine similarity and the gold standard, Kendall’s τ was de-    The recommendation of courses is based on two approaches.
termined as a rank correlation coefficient with a value of      Firstly, other courses with a high cosine similarity were pro-
0.451. In general, smaller values of correlation are obtained   posed for a course. The suggestions were justified by a list
for Kendall’s τ compared to Pearson’s r. However, the low       of the particularly similar course units (see Fig. 6). In this
value is also due to the individual definition of the concept   way, the algorithmic decision can be understood on the basis
of similarity and the individual mapping of the subjectively    of the available texts.
perceived similarity to the scale. Looking at the areas of
high similarity shown in Fig. 4, the correlation is more ob-    Secondly, the Alternating Least Squares Algorithm by Hu,
vious.                                                          Koren, and Volinski [14] was used for collaborative filtering
                                                                in order to create a recommendation system based on the
In addition to the gold standard, the NLP system was checked    courses that of other students have been enrolled to in the
for plausibility of the results. Two hypotheses were put for-   past. Collaborative filtering often works with explicit feed-
ward for this purpose:                                          back based on user ratings. However, course enrollment data
                                                                does not express an assessment but a learner’s preference,
                                                                which is called implicit feedback. By choosing a course, a
                                                                student indirectly expresses his or her preferences. Students
 H1 Course units of consecutive courses are more similar
                                                                who have taken similar courses may be interested in similar
    than units of other courses.
                                                                courses in the future. The numerical result of the implied
                                                                feedback indicates the confidence, but not the students’ pref-
 H2 Course units of one course are more similar than units      erence for a course. The user behavior can be used to de-
    of other courses.                                           duce which courses the user is likely to prefer. Fig. 7 shows
Figure 3: Minimizing the mean square error for multiple configurations of hyperparameters. Each dot
represents a hyperparameter configuration. The highlighted dot in solid blue represents the best parameter
combination.


                                                                                                    variable
                  0.5                                                                               cosine
                                                                                                    gold

                  0.4
     Similarity




                  0.3

                  0.2

                  0.1

                  0.0
                        0        20                 40                 60                 80
                                                         Test pairs


       Figure 4: Ratio of gold standard (orange) to cosine similarity (blue) for the individual test pairs




Figure 5: Distribution and mean value of the cosine similarity in the entire corpus (blue), between the
consecutive courses (orange), and the course units (green).
a screen grab of the recommender system.

The filtering procedure described here only briefly has clear
limits. For example, the order in which courses are taken
is not considered. However, this can have a high relevance,
as a student should not be recommended to take any more
basic courses at the end of his studies. The method always
interprets the attendance of a course as a positive factor.
However, this is not always the case, for example, because a
student attends a course but has not perceived it as interest-
ing or valuable. Furthermore, there are compulsory modules
in many courses of study, which must be attended in any
case. However, this is a general disadvantage of recommen-
dations based on implicit feedback. The chosen approach
of collaborative filtering cannot make recommendations for
prospective students who have not taken a course. In this
case, however, the usual introductory courses of a degree
program can be recommended. Besides the examination of
certain subjects the course choice is not constraint by study
regulations or other pre-requesites at our faculty. Such con-
straints might have to be considered for course recommender
systems.
                                                                 Figure 7: Course recommendations based on the in-
                                                                 dividual course of study and the data on the enroll-
                                                                 ment of all students in the study program




                                                                 Figure 8: Extract from the dashboard for teachers


                                                                 5.3   Adaptive course recommendations for long
                                                                       study texts
Figure 6: Course details view with a list of related             In the third use case, adaptive navigation support in the
courses                                                          sense of direct guidance [8] was integrated in the online
                                                                 learning environment Moodle. The Moodle standard page
                                                                 plugin (mod page) has been enhanced for the readability of
                                                                 long texts [18], so that the course texts, some of which are
5.2   Teacher dashboard                                          over 60 DIN-A4 pages long, can also be used on screen.
The second application scenario is primarily aimed at teach-
ers and authors of learning materials. In a Learning Ana-        The marginal columns of the text are used to point readers
lytics Dashboard [25] course occupancy statistics are linked     to chapters of other courses that are very similar to the cur-
with the semantic relations of the course materials. By in-      rently displayed text paragraph ,The recommendations are
cluding the semantic textual similarity of other courses and     limited to two links per text paragraph. No recommenda-
course chapters, responsible teachers can identify connec-       tions are made for paragraphs of less than 100 words. The
tions to other courses and identify possible content duplica-    threshold value for the degree of similarity was chosen rel-
tions. The dashboard consists of six tiles in a three-column     atively high in order to avoid recommendations of courses
layout: (1) An adjacency matrix shows the similarity of the      that show only a little similarity.
course units contained in the course (Fig. 8, left). (2) The     In terms of adaptive learning it is taken into account whether
five most similar courses are shown in a matrix (Fig. 8, mid-    the learner has already taken the recommended course. This
dle). (3) A line chart shows the course attendance of the        information will be analyzed in relation to the learning pro-
last few years (Fig. 8, right). In addition, the dashboard       gress in the current Moodle course. In case of a lower
contains statistics of the most frequently (4) previously, (5)   progress and comparatively low quiz results and only a few
simultaneously and (6) subsequently attended courses in the      points achieved in the assignments we want to encourage
form of horizontal bar charts.                                   the learner to make use of his previous knowledge, which he
has acquired in previous courses. Consequently, the recom-        not been considered either, but could be learned from la-
mended links point to courses that the learner already know       beled texts and applied to other texts. In order to be able
and which are semantically related to the currently displayed     to reproduce the learning materials of a course completely
text paragraph. In the second case high performing students       in the corpus, texts from diagrams and other visualizations
or those who almost completed the current Moodle course           should also be included. The possible applications shown
will be provided with links to courses they have not enrolled     in section 5 illustrate possible fields of application for the
so far. Often these are more advanced courses, if the stu-        use of semantic relations of study texts, but require further
dents are in the beginning of their studies or if they have       investigation – especially user studies.
already enrolled to the primitive courses. In this way, we
would like to encourage students to deepen their knowledge        In all three use cases it becomes clear that the textual simi-
in a specific area through targeted course recommendations.       larity of the learning materials alone is not sufficient to rec-
                                                                  ommend courses, present comprehensive data for course au-
6.   CONCLUSION AND OUTLOOK                                       thors or make meaningful recommendations in an adaptive
                                                                  learning environment. Apart from that, the identification of
An expandable corpus of the Faculty of Mathematics and
                                                                  course duplicates and overlaps might be another interesting
Computer Science of the FernUniversität in Hagen was cre-
                                                                  use case for the corpus of study materials. In order to en-
ated. Special attention was paid to the fact that this corpus
                                                                  able further research of this kind, we are trying to publish
can be extended without manual effort. The corpus allows
                                                                  the text corpus as research data.
a storage-efficient access to single course units or to several
units per faculty, chair and course, so that it can serve as a
basis for further studies. Subsequently, methods for feature      7.   REFERENCES
extraction of the documents were investigated. The focus           [1] E. Agirre, M. Diab, D. Cer, and A. Gonzalez-Agirre.
was on the mapping of semantics in the vector representa-              SemEval-2012 task 6: a pilot on semantic textual
tion. For the selected PV model from [17] it was shown                 similarity, 2012.
that PV can map semantic information even in texts with            [2] S. B. Aher and L. Lobo. Combination of machine
several thousand words. The results were evaluated with a              learning algorithms for recommendation of courses in
gold standard and show a high correlation to it. In relation           E-Learning System based on historical data.
to comparable studies (e.g. [9]), this paper compared much             Knowledge-Based Systems, 51:1–14, 2013.
larger texts with several thousand sentences instead of just       [3] A. Askinadze and S. Conrad. Development of an
individual sentences, which can be more precisely semanti-             Educational Dashboard for the Integration of German
cally assigned. By means of Word and Document Embed-                   State Universities’ Data. In Proceedings of the 11th
dings, the similarity of two courses can be justified to the           International Conference on Educational Data Mining,
users of the system by considering the subordinate course              pages 508–509, 2018.
units belonging to a course. In a next development step, a         [4] A. Baumann, M. Endraß, and A. Alezard. Visual
chapter-by-chapter or page-by-page analysis could make the             Analytics in der Studienverlaufsplanung. In Mensch &
relations of the units comprehensible by means of the rela-            Computer, pages 467–469, 2015.
tions of the chapters contained in the course. In order to         [5] S. Bird, E. Klein, and E. Loper. Natural language
improve the reliability of the evaluation, we have presented           processing with Python: analyzing text with the natural
an approach to define a gold standard and two metrics (H1              language toolkit. O’Reilly Media, 2009.
and H2) for assessing STS for larger texts. However, the           [6] M. Bostock, V. Ogievetsky, and J. Heer. D3
gold standard needs to be extended to make better conclu-              Data-Driven Documents. IEEE Trans. Vis. Comput.
sions about the quality of the approach. However, there is
                                                                       Graph., 17(12):2301–2309, 2011.
also a need for other metrics that can be determined with
                                                                   [7] N. Brackhage, Carsten Schaarschmidt, E. Schön, and
less effort in order to large text similarity.
                                                                       N. Seidel. ModuleBase: Inter-university database of
                                                                       study programme modules, 2016.
In this article it was shown by way of example how the STS
can be examined by extensive textual learning resources of         [8] P. Brusilovsky. Adaptive Navigation Support. In The
a distance-learning university. However, the methods are               Adaptive Web: Methods and Strategies of Web
also transferable to traditional universities, which work more         Personalization, pages 263–290. 2007.
with presentation slides and online resources. Furthermore,        [9] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and
it is conceivable to compare courses and study programs                L. Specia. SemEval-2017 Task 1: Semantic Textual
of different universities [7] and thus facilitate the choice of        Similarity - Multilingual and Cross-lingual Focused
study places. From the administrative perspective of course            Evaluation. arxiv.org, 2017.
planning and accreditation further fields of application of the   [10] A. M. Dai, C. Olah, Q. V. Le, T. Mikolov, K. Chen,
technology could arise. This only works as far as textual              G. Corrado, and J. Dean. Efficient Estimation of
representations of learning materials such as presentation             Word Representations in Vector Space. CoRR,
slides, video transcripts or online courses cover the content          abs/1507.0, jul 2013.
of a course.                                                      [11] M. D’Aquin and N. Jay. Interpreting data mining
                                                                       results with linked data for learning analytics:
The STS approach used here is subject to some limitations,             motivation, case study and directions. In D. Suthers
which at the same time indicate a need for further research.           and K. Verbert, editors, Third Conference on Learning
In connection with documents embeddings, intrinsic infor-              Analytics and Knowledge, LAK ’13, Leuven, Belgium,
mation on the content of the documents has not been con-               April 8-12, 2013, pages 155–164. ACM, 2013.
sidered so far (see [10] and LDA or LSA). Homonyms have           [12] K. Gábor, H. Zargayouna, I. Tellier, D. Buscaldi, and
     T. Charnois. Exploring Vector Spaces for Semantic         [28] L. Wittgenstein. Philosophical Investigations. In New
     Relations. In Proceedings of the 2017 Conference on            York: The Macmillan Company, page 272. Blackwell,
     Empirical Methods in Natural Language Processing,              1953.
     pages 1814–1823, Stroudsburg, PA, USA, 2017.              [29] F. Zablith, M. Fernandez, and M. Rowe. Production
     Association for Computational Linguistics.                     and consumption of university Linked Data.
[13] E. Grefenstette. Analysing Document Similarity                 Interactive Learning Environments, 23(1):55–78, 2015.
     Measures. PhD thesis, University of Oxford, 2009.         [30] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov,
[14] Y. Hu, Y. Koren, and C. Volinsky. Collaborative                R. Urtasun, A. Torralba, and S. Fidler. Aligning
     Filtering for Implicit Feedback Datasets. In 2008              Books and Movies: Towards Story-like Visual
     Eighth IEEE International Conference on Data                   Explanations by Watching Movies and Reading Books.
     Mining, pages 263–272, dec 2008.                               arXiv e-prints, page arXiv:1506.06724, jun 2015.
[15] A. A. Kardan, H. Sadeghi, S. S. Ghidary, and
     M. R. F. Sani. Prediction of student course selection
     in online higher education institutes using neural
     network. Computers & Education, 65:1–11, 2013.
[16] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel,
     A. Torralba, R. Urtasun, and S. Fidler. Skip-Thought
     Vectors. CoRR, abs/1506.0, 2015.
[17] Q. V. Le and T. Mikolov. Distributed Representations
     of Sentences and Documents. jmlr.org, 2014.
[18] Q. Li, M. R. Morris, A. Fourney, K. Larson, and
     K. Reinecke. The Impact of Web Browser Reader
     Views on Reading Speed and User Experience. In
     Proceedings of the 2019 CHI Conference on Human
     Factors in Computing Systems, CHI ’19, New York,
     NY, USA, 2019. Association for Computing
     Machinery.
[19] J. Lin, H. Pu, Y. Li, and J. Lian. Intelligent
     Recommendation System for Course Selection in
     Smart Education. Procedia Computer Science,
     129:449–453, 2018.
[20] B. Liu, T. Zhang, D. Niu, J. Lin, K. Lai, and Y. Xu.
     Matching Long Text Documents via Graph
     Convolutional Networks. CoRR, abs/1802.0, 2018.
[21] T. Mikolov, K. Chen, G. Corrado, and J. Dean.
     Efficient Estimation of Word Representations in
     Vector Space. 2013.
[22] I. Ognjanovic, D. Gasevic, and S. Dawson. Using
     institutional data to predict student course selections
     in higher education. The Internet and Higher
     Education, 29:49–62, 2016.
[23] J. Pennington, R. Socher, and C. D. Manning. GloVe:
     Global Vectors for Word Representation. In Empirical
     Methods in Natural Language Processing (EMNLP),
     pages 1532–1543, 2014.
[24] R. Rehurek and P. Sojka. Software Framework for
     Topic Modelling with Large Corpora. In Proceedings
     of the LREC 2010 Workshop on New Challenges for
     NLP Frameworks, pages 45–50, Valletta, Malta, 2010.
     ELRA.
[25] B. Schwendimann, M. Rodriguez-Triana, A. Vozniuk,
     L. Prieto, M. Boroujeni, A. Holzer, D. Gillet, and
     P. Dillenbourg. Perceiving learning at a glance: A
     systematic literature review of learning dashboard
     research. IEEE Transactions on Learning
     Technologies, PP(99):1, 2016.
[26] N. Spasojevic and G. Poncin. Large Scale Page-Based
     Book Similarity Clustering. In ICDAR 2011, 2011.
[27] S. M. Weiss, N. Indurkhya, and T. Zhang.
     Fundamentals of Predictive Text Mining. Texts in
     Computer Science. Springer London, London, 2015.