Online Courses Recommendation based on LDA


    Rel Guzman Apaza, Elizabeth Vera Cervantes, Laura Cruz Quispe, José Ochoa Luna
                                         National University of St. Agustin
                                                  Arequipa - Perú
             {r.guzmanap,elizavvc,lvcruzq,eduardo.ol}@gmail.com


                      Abstract                                are primarily focused. To do so, we rely on Topic
                                                              Models (Blei, 2012), an unsupervised probabilistic
     In this paper we propose a course recommen-
                                                              generative model, which given a set of documents
     dation system based on historical grades of
     students in college. Our model will be able to           and a number of topics as input, automatically re-
     recommend available courses in sites such as:            turns a relevant set of words probabilistically asso-
     Coursera, Udacity, Edx, etc. To do so, proba-            ciated for each topic. Why this scheme is valuable?,
     bilistic topic models are used as follows. On            consider for instance a huge number of digitalized
     one hand, Latent Dirichlet Allocation (LDA)              books of a public library, this algorithm can auto-
     topic model infers topics from content given in          matically discover main topic words and therefore
     a college course syllabus. On the other hand,
                                                              allows one to gain insights about content in books.
     topics are also extracted from a massive online
     open course (MOOC) syllabus. These two sets                 Currently educational systems and data mining
     of topics and grading information are matched            is an emerging research area (Romero and Ventura,
     using a content based recommendation system              2010), these systems use different recommendation
     so as to recommend relevant online courses to
                                                              techniques in order to suggest online learning ac-
     students. Preliminary results show suitability
     of our approach.                                         tivities, based on preferences, knowledge and data
                                                              from other students with similar interests (Romero
                                                              et al., 2007). In (Kuang et al., 2011) the author pro-
1   Introduction
                                                              vides resource recommendation for users in the e-
Nowadays, the amount of educational resources                 learning system based on contents and user log ac-
spread at Internet is huge and diverse (Martin, 2012).        tivities. There was proposed a method for resource
Massive Online Open Courses (MOOCs) such us                   recommendation based on topic modeling in an e-
Coursera, Udacity, EdX, to name a few, are gain-              learning system, that system used Latent Dirichlet
ing momentum (Fischer, 2014). It is possible to find          Allocation (LDA) to get a low dimension vector,
courses from almost every knowledge domain. This              and to do inference it used Gibbs sampling, then in
vast offer overwhelm any user willing to find courses         resource recommendation it applied cosine similar-
according his/her background. This task can be te-            ity in document topic distribution to find neighbor
dious because it involves access to each platform,            resources. The authors from (Haruechaiyasak and
search available courses, select some courses, read           Damrongrat, 2008) also recommended documents,
carefully each course syllabus, and choose appropri-          in this case it recommended articles from wikipedia
ate content. This process can be unmanageable if              by calculating the similarity measures among topic
we extend our search beyond online courses to edu-            distributions of the articles. The model proposed in
cational content.                                             (Sadikov and Bratko, 2011) is an hybrid recommen-
   In this work we propose a system for online                dation system where the core of the system is a lin-
courses recommendation, although MOOCs courses                ear regression model, based on stochastic gradient


                                                         42
descent. For predicting the rank of a lecture, they            each document is a combination of topics and each
used and compared the predictions made by content-             topic is a probability distribution over words (Blei
based and collaborative-based methods. In this pa-             et al., 2003). Topic models are a type of graphical
per they established manually the attributes that rep-         model based on Bayesian networks.
resent each video-lecture, unlike our paper, where                The generative process described by a topic model
the attributes for the courses are defined by the LDA          does not make any assumptions about the order of
algorithm. In (Sadikov and Bratko, 2011), to find a            words as they appear in documents. The only infor-
rank they measured the correlation between an old              mation relevant to the model is the number of times
lecture (a lecture the visitor has already seen), and          words are produced, this is known as the “bag-of-
the new lectures (lectures that visitor has not seen           words” assumption (Steyvers and Griffiths, 2007).
yet), and then they ordered theses measures in a list,            There are two main topic models: LDA (Blei et
where the lowest comes first, theses computations              al., 2003) and Probabilistic Latent Semantic Anal-
were used in the linear regression model. Also they            ysis (pLSA) (Hofmann, 1999). In this work we use
said that there was not to much difference between             LDA due to its general model. It is also worth noting
using content-based or collaborative-based methods,            that LDA has been previously used in recommenda-
but they said that their system could have been im-            tion systems (Romero and Ventura, 2010; Romero et
proved if they used textual attributes, which is our           al., 2007; Kuang et al., 2011).
case.
   In our proposal, Latent Dirichlet Allocation                2.2   Topics Modeling using Latent Dirichlet
(LDA) (Blei et al., 2003) topic model is mainly used                 Allocation
as feature descriptor of courses. Thus, we assume
                                                               Latent Dirichlet allocation (LDA) (Blei et al., 2003)
that each course has a set of inherent topics and
                                                               is widely used for identifying topics in a set of docu-
therefore relevant words that summarize them. In
                                                               ments, building on previous work by Hofmann (Hof-
our content-based recommendation setting those are
                                                               mann, 1999). The corresponding graphical model
input features that describe courses. We are con-
                                                               representation is depicted in Figure 1, where each
cerned in discovering the parameter vector of users,
                                                               document is represented as a mixture of a fixed num-
i.e., weights over topic words that denote user pref-                                                            (d)
                                                               ber of topics, with topic z receiving weight ✓z in
erences on courses. In order to infer this user vector,
                                                               document d, and each topic is a probability distribu-
we rely on supervised machine learning algorithms
                                                               tion over a finite vocabulary of words, with word i
thus, we assume grading obtained in college courses                                 (z)
as ratings, learn user weights and ratings are pre-            having probability i in topic z.
dicted for unseen MOOCs courses. Preliminary re-
sults show suitability of this approach.                                                               G

   The paper is organized as follows. In Section 2                                   Į             ș

background is given. In Section 3, our proposal is
presented. Section 4 shows experimental results. Fi-                                               ]
nally, Section 5 concludes the paper.
                                                                                      ]
2   Background                                                             ȕ        ĭ             Z

                                                                                          7                1G '
2.1 Probabilistic Topic Modeling
Topic models are probabilistic models that have                Figure 1: Graphical model for the topic modeling using
been mainly used to discover topics in a big col-              plate notation
lection of text documents. They are non supervised
learning (Duda et al., 2012) techniques that do not              Symmetric Dirichlet priors are placed on ✓(d)
require any prior annotations or labeling of the doc-          and ( j), with ✓(d) ⇠ Dirichlet(↵) and ( j) ⇠
uments: the topics emerge from the analysis of the             Dirichlet( ), where ↵ and are hyper-parameters
original texts (Blei, 2012). To do so, they assume             that affect the sparsity of these distributions. The


                                                          43
hyper-parameter ↵ can be interpreted as a prior ob-                zN \j indicates (z1 , . . . , zj 1 , zj+1 , . . . , zN )
servation count for the number of times a topic is                 W is the size of the vocabulary
sampled in a document, and as the prior observa-                    (w )
                                                                          \j is the number of times a word wj is
                                                                        j
tion count on the number of times words are sampled                nzj ,N
from a topic before any word from the corpus is ob-                     assigned to topic zj
served. This smooths the word distribution in every                 (·)
                                                                   nzj ,N \j is the total number of words assigned to
topic, with the amount of smoothing determined by                       topic zj
  . The goal of inference in this model is to identify              (d )
the values of and ✓, given a corpus of D documents                 nzj j,N \j is the number of times a word in document
represented by a vocabulary of W words.                                  dj is assigned to topic zj
   In our proposal, each course is a document d that                (d )
                                                                   n·,Nj \j is the total number of words in document dj
has its related sequence of Nd word tokens, N words
in the overall corpus.                                               From this probability distribution it is possible
                                                                  to make inference, in order to compute conditional
2.3 Gibbs Sampling Algorithm
                                                                  probability of topic structure given the observed
There are many algorithms proposed to obtain                      document. The probability distribution of topics in
the main variables of interest ✓ and in the lit-                  a document represents a feature vector for that doc-
erature, (Hofmann, 1999) used the expectation-                    ument.
maximization (EM) algorithm, this approach suffers
from problems involving local maxima of the like-                 2.4     Recommender Systems
lihood function, which has motivated a search for                 According to (Ricci et al., 2011), recommender
better estimation algorithms like the ones proposed               systems are software tools and techniques provid-
in (Blei et al., 2003; Buntine, 2002; Minka and Laf-              ing items suggestions for a given user. Suggestions
ferty, 2002).                                                     provided are aimed at supporting their users in var-
   Instead of directly estimating the variables for               ious decision-making processes, such as what items
each document, another approach is the algorithm                  to buy, what music to listen, or what news to read.
called “Gibbs sampling” (Griffiths and Steyvers,                     As a rule, in a recommendation-system applica-
2004), which provides a relatively efficient method               tion there are two classes of entities, which we shall
of extracting a set of topics from a large corpus.                refer to as users and items. Users have prefer-
Gibbs sampling considers each word token in the                   ences for certain items and these preferences must
text collection in turn, and estimates the probability            be teased out of the data (Rajaraman and Ullman,
of assigning the current word token to each topic,                2012). The data itself is represented as a utility ma-
conditioned on the topic assignments to all other                 trix, giving for each user-item pair, a value that rep-
word tokens. From this conditional distribution,                  resents what is known about the degree of preference
given a document, a topic is sampled and stored as                of that user for that item. Values come from an or-
the new topic assignment for this word token. We                  dered set, e.g., integer 1 5 representing the number
write this conditional distribution as:                           of stars that the users gave as a rating for that item.
                          (w )j              (d )                 We assume that the matrix is sparse, meaning that
                         nzj ,N \j +        nzj j,N \j + ↵
P (zj |zN \j , wN ) =                   ·                         most entries are unknown. An unknown rating im-
                         (·)                 (d )
                        nzj ,N \j + W       n·,Nj \j + T ↵        plies that we have no explicit information about the
                                                                  user’s preference for the item. The goal of a rec-
where:                                                            ommendation system is to predict the blanks in the
 wN = (w1 , . . . , wN ) are the words in the entire              utility matrix.
    corpus                                                           There are two basic architectures for a recommen-
                                                                  dation system (Rajaraman and Ullman, 2012):
 zN = (z1 , . . . , zN ) are the topic assignments of
    the words                                                       • Content-based systems focus on properties of
                                                                      items. Similarity of items is determined by


                                                             44
      measuring the similarity in their properties                 courses        c. features      profile user(1)—rating ...
                                                                                                    (1)       (1)
    • Collaborative-Filtering system focus on the re-              Calculus       x 1 , . . . xn   ✓1 , . . . , ✓n — 12
      lationship between users and items. Similarity               ···            ···              ···
                                                                                    0          0    (1)          (1)
      of items is determined by the similarity of the              ML(Mooc)       x 1 , . . . xn   ✓1 , . . . , ✓n —?
      ratings of those items by the users who have
                                                                             Table 1: Utility matrix for courses
      rated both items.

   In a content-based system, we must construct a                all items. The rest of this section describes our main
profile for each item, which is a record of collections          design choices.
of records representing important characteristics of                Consider the utility matrix in Table 1 used to
that item. In simple cases, the profile consist of some          represent a content-based recommendation system.
characteristics of the item that are easily discovered.          First column contains courses names (college and
For example, in a movie there are the set of actors,             MOOC’s courses). Second column contains feature
the director, the genre of general type of movie. In             descriptors for courses. Each row denotes a different
documents it is not immediately apparent what the                course, therefore each course has a different feature
values of features should be. There are many kinds               vector. Third column shows the user vector profile
of documents for which a recommendation system                   ⇥(1) for user 1. This vector could comprise user 1
can be useful. For example, there are many news ar-              preferences about art, math, biology and social sci-
ticles published each day, and we cannot read all of             ences in general. In this same column is also showed
them. A recommendation system can suggest arti-                  user 1 ratings for each course (they are in fact grades
cles on topics a user is interested in. Unfortunately,           obtained in college for user 1, see for instance rating
documents do not tend to have available information              12 for calculus). Further columns for user 2, user
giving features. A substitute that has been useful in            3 and so on should be added accordingly. Our goal
practice is the identification of words that character-          is to predict missing ratings for MOOC’s courses (?
ize the topic of a document. An approach is to com-              symbol in last row) for user 1 (user 2, 3, etc.). In or-
pute the T F (Term frequency) - IDF (Inverse doc-                der to do so, we should perform the following steps:
ument frequency) score for words in the document.
The ones with the highest scores are the words that                • Extract item vectors for courses: item vectors
characterize the document. In this sense, documents                  are defined by courses content, i.e., text that
are represented by sets of words. In this paper we                   describes courses, such as “about the course“
have used a different approach which relies on find-                 information. In order to construct item vectors
ing document topic information by using topic mod-                   (features from documents), we rely on Latent
eling algorithms such as LDA.                                        Dirichlet Allocation algorithm which extracts
                                                                     topic information from text as probability dis-
3    Proposal                                                        tribution of words. Since we use a machine
                                                                     learning setting, item vectors are features of
In order to recommend online courses, each course                    a regression/classification problem, which we
is considered a document which has a given con-                      denote X = {X1 , X2 , . . . , Xn }.
tent. To characterize each course, LDA is used to
uncover the semantic structure hidden in the docu-                 • Learn user’s vector: interests about topic
ment. Since LDA allow us to get a topic distribution                 courses can be modeled by user’s vector which
for each course, this output is used as a feature vec-               should be learned for each user. To do
tor for courses (items according to a content-based                  so, we use a machine learning approach, all
recommendation setting). A recommendation sys-                       available ratings (grading information in col-
tem is built using item profiles and utility matrices                lege) are used to train a multilinear regression
and we treat the problem as one of machine learn-                    model (Bishop and others, 2006). The user’s
ing. Regard the given data as a training set, and for                vector is therefore the resulting set of param-
                                                                                                        (1)      (1)
each user, build a classifier that predicts the rating of            eters (or weights), ⇥(1) = {✓1 , . . . , ✓n }


                                                            45
     learned from training data (for instance, all
     courses and gradings of user 1). There are m
     (number of users) set of parameters. In a multi-
     linear regression algorithm we want to find the
     values for ⇥, that minimize the cost function:
                                 1 Pm          (i)
     J(⇥0 , ⇥1 , . . . , ⇥n ) = 2m  i=1 (h⇥ (x )
     y)i 2


     We define an hypothesis:
     h⇥ (x) = ⇥T x = ⇥0 x0 +⇥1 x1 +⇥2 x2 +. . .+
     ⇥ n xn
     Where ⇥0 , ⇥1 , . . . , ⇥n are the parameters we
     want to predict minimizing the cost function.
     One way to minimize the cost function is by
     using gradient descent method, where each it-
     eration of gradient descent makes the parame-
     ters ✓j come closer to the optimal values that          Figure 2: Block diagram for the recommendation system
     will minimize the cost function J(✓).
                                                             4   Experimental Results
     For n > 1                                               This section shows preliminary experimental results
     Repeat {                                                conducted on real world data sets. Courses and users
                   1 Pm           (i)
     ⇥j := ⇥j ↵ m       i=1 (h⇥ (x )    y i )xj (i)          grading information where extracted from a Peru-
     (simultaneously update ⇥j for j = 0, ...n) }            vian university. Some MOOC’s courses were ex-
                                                             tracted from Coursera, the following categories were
                                                             considered: “business and management”, “computer
                                                             science - artificial intelligence”, “computer science -
  • Given item and user vectors the goal is to pre-          software engineering”, “computer science - systems
    dict a rating RC for a MOOC course C with                and security”, “computer science - theory”, “math-
    feature vector XC for user U, i.e., user vector          ematics”, “statistics and data analysis”. The most
    profile ⇥(U ) , the resulting predicted rating is        significant information from each course is given by
    given by:                                                “Introduction”, “About the Course” and “FAQ” sec-
                                                             tions.
                                                                All extracted information has been preprocessed
                     RC = XCT ⇥(U )                          according to the following process: remove non
                                                             ASCII characters, strip HTML tags, remove special
                                                             strings, remove multiple spaces and blank lines.
                                                                After that we built a corpus further used by the
                                                             LDA algorithm. The number of Coursera courses
                                                             considered was 69, while the number of college
   An overview of the recommendation system is de-           courses was 43, which gives rises to 112 courses.
picted in Figure 2 where we estimate the ratings for         The topic modeling algorithm used the gibbs sam-
a student and to recommend a course we consider a            pling inference procedure and according to (Blei,
“top-10 best recommendations” approach thus, each            2012) we set parameters ↵ = 50/T , = 0.01. The
student get always 10 recommended courses. Those             number of iterations was chosen to be large enough
are the most related MOOCs to courses in which a             to guarantee convergence, N = 200.
student get the 10 lowest grades.                               To measure performance, accuracy was consid-


                                                        46
ered by counting the number of correct matches be-
tween college courses and Coursera courses. Figure                                     0.9
3 illustrates the impact of the number of topics T in
the topic model. A higher accuracy is achieved when                                    0.8

we use a higher number of topics, then we set the
number of topics T = number of Coursera courses                                        0.7


                                                                            Accuracy
because of the precision.
                                                                                       0.6


                                                                                       0.5


                                                                                       0.4

              0.7

                                                                                                                 10              20              30                    40
                                                                                                                  Number of Top-Recommended course(top-N)
              0.6
   Accuracy


              0.5                                                     Figure 4: Recommendation system accuracy according to
                                                                      the number of recommended courses
              0.4


                                                                         In Figure 5, a comparison between ratings of
                                                                      “coursera courses” and “college courses” for one
              0.3


              0.2
                                                                      student is showed. We intend to show proximity of
                    0   20   40        60      80    100   120        predicted data (ratings on “coursera courses”) and
                                  Number of Topics
                                                                      provided data (ratings on college courses).
Figure 3: Accuracy of the recommendation system ac-
cording to the number of topics, a better precision and                                20
                                                                                                                                                  Predicted rating Coursera
also efficiency is obtained when the number of topics is                                                                                          Real rating University

equal to the number of coursera courses, T =69
                                                                                       15
                                                                            Rating


   The goal of our proposal is to recommend courses                                    10


for students who have received low grades in college
therefore, we are using grades as ratings. To keep                                      5
a recommendation system setting, we have decided
to invert grading information thus, 20 grade turns
out 0 rating and viceversa (this step might not be                                      0
                                                                                             1   2   3   4   5    6   7   8   9 10 11 12 13 14 15 16 17 18 19 20 21 22
necessary in other recommendation systems). Mean                                                                                  Courses


normalization is also used to get a more reliable rec-
ommendation for students with few grades available,                       Figure 5: Comparison chart of ratings for one student
for instance, first year students.
   For testing, we define a variable “top-N” which                    5         Conclusion
denotes the number of courses to recommend. For
instance, for student “a” we recommend the “top-                      We have introduced a novel approach for recom-
N” courses from Coursera where he/she has gotten                      mending online courses that combines the proba-
the greatest ratings. In Figure 4, the x-axis denotes                 bilistic topic model LDA and content-based recom-
several values for “top-N”, and the y-axis denotes                    mendation systems. In short, we use a machine
accuracy obtained. An cccuracy over 0.6 is achieved                   learning approach where LDA allow us to extract
for “top-N” greater than or equal to 10.                              feature descriptors from courses, rating prediction


                                                                 47
in this setting is performed by inferring user pro-              Thomas Minka and John Lafferty. 2002. Expectation-
file parameters using multilinear regression. Prelim-               propagation for the generative aspect model. In
inary experimental results show that our algorithm                  Proceedings of the Eighteenth conference on Uncer-
performs well when compared to a similar approach                   tainty in artificial intelligence, pages 352–359. Mor-
                                                                    gan Kaufmann Publishers Inc.
based on cosine similarity with LDA.
                                                                 Anand Rajaraman and Jeffrey David Ullman. 2012. Min-
   Although we have focused on MOOCs as source                      ing of massive datasets. Cambridge University Press,
of recommendation content, nothing prevent us from                  Cambridge.
using this approach beyond such domain. In fact,                 Francesco Ricci, Lior Rokach, and Bracha Shapira.
further domains can be included by performing fea-                  2011. Introduction to recommender systems hand-
ture topic extraction. Future work will be addressed                book. Springer.
to investigate scalability issues. In this sense, topic          C. Romero and S. Ventura. 2010. Educational data min-
models such as LDA, have scalable versions avail-                   ing: A review of the state of the art. Systems, Man, and
able. For instance, a MapReduce implementation is                   Cybernetics, Part C: Applications and Reviews, IEEE
                                                                    Transactions on, 40(6):601–618, Nov.
given in the Apache Mahout library1 . There are also
                                                                 Cristbal Romero, Sebastin Ventura, JoseAntonio Del-
scalable versions for multilinear regression.                       gado, and Paul De Bra. 2007. Personalized links rec-
                                                                    ommendation based on data mining in adaptive educa-
                                                                    tional hypermedia systems. 4753:292–306.
References
                                                                 Er Sadikov and Ivan Bratko. 2011. Recommending vide-
Christopher M Bishop et al. 2006. Pattern recognition               olectures with linear regression.
  and machine learning, volume 1. springer New York.             Mark Steyvers and Tom Griffiths. 2007. Probabilistic
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.                 topic models. Handbook of latent semantic analysis,
  2003. Latent dirichlet allocation. J. Mach. Learn.                427(7):424–440.
  Res., 3:993–1022, March.
David M Blei. 2012. Probabilistic topic models. Com-
  munications of the ACM, 55(4):77–84.
Wray Buntine. 2002. Variational extensions to em and
  multinomial pca. In Machine Learning: ECML 2002,
  pages 23–34. Springer.
Richard O Duda, Peter E Hart, and David G Stork. 2012.
  Pattern classification. John Wiley & Sons.
Gerhard Fischer. 2014. Beyond hype and underestima-
  tion: identifying research challenges for the future of
  moocs. Distance Education, 35(2):149–158.
Thomas L Griffiths and Mark Steyvers. 2004. Finding
  scientific topics. Proceedings of the National academy
  of Sciences of the United States of America, 101(Suppl
  1):5228–5235.
Choochart Haruechaiyasak and Chaianun Damrongrat.
  2008. Article recommendation based on a topic model
  for wikipedia selection for schools. 5362:339–342.
Thomas Hofmann. 1999. Probabilistic latent semantic
  indexing. In Proceedings of the 22nd annual interna-
  tional ACM SIGIR conference on Research and devel-
  opment in information retrieval, pages 50–57. ACM.
Wei Kuang, Nianlong Luo, and Zilei Sun. 2011. Re-
  source recommendation based on topic model for edu-
  cational system. 2:370–374, Aug.
Fred G. Martin. 2012. Will massive open online courses
  change how we teach? Commun. ACM, 55(8):26–28,
  August.
   1
       https://mahout.apache.org/


                                                            48