Constructing a Dynamic Bayes Net Model of Academic Advising Joshua T. Guerin and Judy Goldsmith∗ Department of Computer Science University of Kentucky Lexington, KY 40506-0046 jtguer2@uky.edu, goldsmit@cs.uky.edu Abstract grade (1–5) or a grade based on letters or “stars” which is easily mapped to numerical grade (for instance 1–5 In this paper we apply ideas from collabora- stars, or a letter grade of A–E). tive filtering to the problem of building dy- CF algorithms can be roughly divided into model- namic Bayesian network (DBN) models for based and memory-based algorithms. Model-based al- planning. We demonstrate that item-based gorithms involve generating a predictive model based collaborative filtering can be used to con- on the data and using it to make preference-related struct dynamic Bayesian networks for use predictions. One formalism that has seen success in in large, factored domains with sparse data. model-based CF is the Bayesian network [1]. Such Bayesian networks can model the tran- sition function for decision-theoretic plan- Memory-based CF operate over the database of items ning. We demonstrate the feasibility and ef- to make predictions, leveraging a measure of similarity fectiveness of this technique on an academic between users or (more commonly) between items to advising domain, based on student grades in determine grades for unseen items. This class of algo- computer science and related courses at the rithms provides us with several notable features which University of Kentucky. are useful for making predictions. Namely, these algo- rithms are designed to operate over very large datasets (common examples include the Netflix dataset, the 1 Introduction MovieLens datasets, or the Amazon.com recommenda- tion system). Such datasets typically contain tens of In this paper we examine the use of memory-based CF thousands of items and grades from hundreds of thou- algorithms for constructing static models of data. This sands of users, however since most users only provide work is grounded in the real-world domain of academic grades for a small percentage of items these datasets advising. We use an item-based collaborative filtering are very sparse. Because of this, modern recommenda- algorithm to generate dynamic Bayesian network mod- tion systems must scale well and must work well with els of an advising domain from sparse grade data. very sparse data. Collaborative filtering (CF) algorithms are designed to aggregate the opinions or preferences of a large num- 2 A Predictive Model for Academic ber of users to extrapolate information about unnamed Advising preferences for new and existing users. Recommenda- tion systems are constructed using CF techniques to locate items in a database which a target user is likely Reasoning in the domain of undergraduate academic to prefer. Preferences are typically defined by grades advising is often approached as a deterministic process. that the user provides either explicitly (by the user Short and long-term decision making is based on the providing grades for items that have already seen) or assumption that a student’s actions (i.e., taking one implicitly (often indicated by patterns of behavior such or more courses) will succeed. This doesn’t capture as browsing habits). These grades can be represented the nuances and complexity of the real world. The in a number of ways, but are often numerical in na- outcome of taking a course can not always be predicted ture; most recommender systems ask for a numerical with certainty; even a student who makes consistent ∗ A’s may perform poorly at some point. This material is based upon work supported by the National Science Foundation under Grant No. 1049360. Given the stochastic nature of grade prediction, it may be desirable to construct statistics-based models of narrow down choices based on a user’s preferences and student performance from real world data. Students the preferences of current and past users. A common leave behind tangible evidence of progress in the form example application is predicting preferences over un- of transcript data. Universities amass a wealth of data seen items (movies, music, groceries) based on grades with which to make predictions about grades. From given for other items [1]. this we can construct probabilistic predictive models. The problem of grade prediction very closely resem- The Dynamic Bayesian Network (DBN) formalism has bles the problem of grade prediction in collaborative a number of features which make it ideal for this sort filtering: make predictions about a student’s grades in of modeling. untaken courses, given their past grades and the tran- A DBN model consists of a directed acyclic graph with script data from many past students. Letter “grades” links representing temporal, probabilistic relationships can map directly to integers where A=1 and Fail- between variables and conditional probability tables ure=5. (CPTs) that specify those relationships quantitatively In this paper we present a simple collaborative filtering [2] (a discussion of DBNs will follow in Section 3.2). algorithm, and demonstrate how it is used to generate We are interested in a class of DBNs which model only a valid DBN model of state transitions in the advising a single time-step known as 2-slice DBNs. This im- domain. We use real-world data from the Computer poses restrictions on the underlying graphical struc- Science Department at our university as a testbed for ture. Specifically, variable values at one time-step are our model generation techniques. conditioned only on the values of parent variables at the previous time-step. The structural restrictions imposed on 2-slice DBNs 3 Background make them a potentially compact representation for decision theoretic planning. For this reason we limit 3.1 Bayesian Networks our attention to 2-slice DBNs. A Bayesian network is a directed acyclic graph G = In the case of discrete-valued variables, each child node hV, Ei, where each vertex v ∈ V is a variable with in the DBN has an associated conditional probability domain dom(v). Each v ∈ V has an associated proba- table (CPT) which gives a probability distribution over bility distribution over values in dom(v), conditioned possible values for every possible assignment to parent on the values of P av ⊂ V , the parents of v. These con- variables (incoming edges in the graph) at a previous ditional probability distributions are usually enumer- time-step. Because all possible assignments to parent ated in tabular form as conditional probability tables variables may need to be enumerated explicitly, CPT (CPTs) for each variable. size is exponential in the number of parent variables. For example, a CPT for a single course with 5 parents, Learning of Bayesian networks is often divided into each of which has 6 possible values (A–D, Failure, and structure learning and parameter learning. Structure NOT TAKEN) will have 65 = 7, 776 rows, each con- learning is the problem of learning the graphical struc- taining a probability distribution over the 6 possible ture E by discovering predictive or causal dependen- outcomes. cies between variables. Parameter learning is the prob- lem of learning the conditional probability distribu- For modern computers, tables of this size are unlikely tions for a given network structure. to cause representational issues. However the need for enough data to populate a table’s 66 = 46, 656 proba- Because the space of all possible networks is very large, bilities makes seemingly abundant data seem rather structure learning is usually approached as a heuristic sparse. Popular or required courses may be taken search problem or an exact search of a constrained by hundreds or even thousands of students within version of the search space (see [4,6,11] for examples). the span of several years, but even this is insuffi- Search for an an optimal (or near optimal) network cient to derive realistic probability distributions from structure is guided by some scoring function (one ex- straight statistical analysis. This problem is worse for ample is the log-likelihood scoring function). most courses (and for smaller colleges and universi- Once structure is known, CPT parameters (probabil- ties) where enrollment over several years may reach ity distributions over outcomes) are generally learned only hundreds of students or fewer. from the data. Examples of parameter learning for In order to deal with the problem of prediction when DBNs include maximum likelihood estimation (one ex- data is sparse, we turn to techniques from collaborative ample being the expectation maximization [3] algo- filtering to aggregate the data that is available. Col- rithm), or Bayesian estimation. laborative filtering algorithms are commonly used to Unlike most Bayesian network learning algorithms, our INTR INTR' Rather than selecting a single ideal structural size we choose to make structure size a parameter of our al- gorithm. Since we are considering models for the pur- pose of planning, we must consider the tradeoff be- DISC DISC' tween accuracy of the representation and tractability of planning. Our goal is to be able to generate DBNs of different sizes for different purposes. We examine INUM INUM' how our algorithm fares as a function of structure size in Section 5. At this point we are left with the question of how to select n parent nodes for each node. LTC LTC' Goldenberg et al. approached a similar problem of learning Bayesian network structures from sparse data Figure 1: An example DBN structure. using frequent set mining [5]. Frequent sets are widely used in data mining for learning common co- occurrence between sets of items. The idea of applying validation is based on the quality of predictions rather frequent set mining to academic advising may be useful than of inference. In other words, our work looks in other capacities (learning combinations of courses forward in time rather than backward. We conjec- which should or should not be taken together), how- ture that good learned probabilistic planning models ever co-occurrence of actions is less applicable to build- may actually differ from probabilistic inference models ing predictive models of advising; courses which are learned from the same data. frequently taken together are unlikely to make good predictors for each other. Parent courses should be taken before child courses, otherwise they provide lit- 3.2 2-Slice Dynamic Bayesian Networks tle information. Bayesian networks have been demonstrated to be use- Rather than using co-occurance we make the assump- ful for inference in a number of domains, however the tion that similar variables make better predictors than standard framework does not have an explicit notion dissimilar variables. We examine the use of pairwise of time. A dynamic Bayesian network builds upon the item similarity in selecting parent nodes. Item simi- Bayesian network idea, incorporating temporal or se- larity is commonly used in collaborative filtering and quential aspects of data into its structure. Variables other data mining applications to determine which at one time-step may influence the value of variables items hold the most predictive power for a target item, at future time-steps (or at the same time-step). allowing for better predictions to be made. We are interested in a special case of dynamic Bayesian One of the most common approaches for collaborative networks, the 2-slice dynamic Bayesian network. A 2- filtering is to use the database of user grades to deter- slice dynamic Bayesian network is a Bayesian network mine item-item similarity. For each pair of items in with V = V, V 0 , representing variables at time t and the database a vector of grades is created (retaining t + 1, and edges from V to V 0 (and sometimes between only grades where users voted for both items) [7]. To vertices in V 0 ). In DBNs of this form, V and V 0 may these vectors a number of distance metrics can be ap- be visualized as two separate columns representing, plied. In our implementation we tested two common respectively, the variables at time t and t + 1. vector similarity metrics: Pearson’s correlation coeffi- cient and cosine similarity. This structural formulation implies two theoretical assumptions under which we operate. These are a 3.3 Collaborative Filtering stationary assumption where models are not time- dependant and a Markov assumption where there is no Collaborative filtering recommendation algorithms memory of past states; future values are conditioned typically fall into one of two general categories: model- only over the current system state. based algorithms and memory-based algorithms [1]. Model-based algorithms involve generating a model Figure 1 gives the structure of an example of a 2-slice based on data, and using the model to make predic- DBN which could be used for planning in an academic tions. We are interested in memory-based algorithms domain. This DBN structure shows that the expected which use the entire data set to make predictions. This grade in Logic and Theory of Computing (LTC) is con- class of algorithms is described in Section 3.4. ditioned over the grades obtained in Introduction to Programming (INTR), Discrete Mathematics (DISC), Collaborative filtering algorithms also rely heavily on and Introduction to Numerical Methods (INUM). the notion of similarity. That is, similar users are likely to assign similar grades to items. Likewise, similar After we construct grade distribution we normalize items may also be given similar grades. Collaborative each row of the table to form probability distributions. filtering systems often employ one of these assump- For a grade g, row grade distribution[g] is now a prob- tions. These are known as user-based and item-based ability distribution over actual grades when R predicts collaborative filtering. In this paper we focus on the g. use of item-based collaborative filtering because of the performance demonstrated by these algorithms and Input: Past Users - a database of past user grades. because of their user-independent nature. Output: CPT - A set of CPTs for each course foreach user in Past Users do 3.4 Item-Based Collaborative Filtering foreach item in user’s graded items do p = puser,item ; The collaborative filtering algorithm that we used in actual = actual grade for item; this paper is an item-based algorithm presented by grade distribution[p][actual]++; Sarwar et al. [7]. First, item-item similarity is cal- end culated over all items in the database. For item-item similarity we are using Pearson’s correlation coefficient end and cosine similarity. For a user u and an item i, pre- normalize rows of grade distribution; dictions are made using the weighted sum of u’s grades foreach item do for all items which are similar to i. This can be ex- T = create prediction table for item; pressed as: foreach row in T do P u* = temporary user using grade assignments all similar items,N (si,N ∗ Ru,N ) in row; pu,i = P . (1) p = puser,item ; all similar items,N (|si,N |) add distribution from grade distribution[p] to current row of T; Here, pu,i is the predicted grade that user u might give end item i, si,N is the similarity between items i and N , CPT(course) = T; and Ru,N is the grade that u provided for item N . end Equation 1 produces a single, most likely grade for Algorithm 1: Generate DBNs from CF predictions the given user and item. Because a DBN requires a probability distribution over all possible grades, we are not yet ready to encode our DBNs. The second half of our algorithm constructs a set of prediction tables for each course. A prediction table T for a course c reflects the overall structure of a fi- 4 Algorithm Details nal CPT for c; each row of T contains a set of val- ues for parent variables (defined by δ and our distance The CF algorithm based on the function pu,i described metric). For each row of T , we fill in the probability in Section 3.4 defines a deterministic version of the distribution over grades using the appropriate row of DBN CPTs that we want to generate. We use these grade distribution. predictions and the data from past students’ tran- scripts to generate probability distributions over pos- Each row of the prediction table T implies a hypo- sible grades to produce full CPTs. Algorithm 1 de- thetical user transcript as an assignment over past scribes the process of turning deterministic predictions grades. Using R we can make a prediction p for from pu,i into CPTs. each row. We select a probability distribution from grade distribution[p], adding probability distributions In this algorithm we make the assumption that devia- over grades to each row of T . tion from predictions in past data will produce a dis- tribution which is a reasonable approximation of the After completion, CP T is a set of CPTs for each probability distribution. course, where CP T (c) is the CP T for course c. Given the predictions from the CF function described in 4, we build a distribution table, grade distribution, 5 Results for the set of items with rows and columns indexed by predicted and actual grades. If G1 and G2 are possible In this section we describe the tests we run on the grades, then the grade distribution[G1 ][G2 ] entry in academic advising data. We evaluate the two variants the table is the number of transcripts for which the of the item-based collaborative filtering algorithm on CF algorithm predicted G1 and the student received this dataset. We also generate two baseline DBN mod- G2 for the class in question. els and two collaborative filtering based models, and analyze their performance on this dataset. 0.72 Pearson Correlation Cosine Distance 0.7 Mean absolute error 5.1 Data and Experimental Setup 0.68 Models are generated from the transcript data for 0.66 approximately 4760 undergraduate students who en- rolled during the 2000–2003 academic years. These anonymized data are a time-stamped (semester and 0.64 year) series of transactions labeled with course and in- structor information and grade outcomes. Because we 0.62 have meta-data from computer science courses, we re- stricted our attention to students who took computer 0.6 science courses during their academic careers. 0 5 10 15 20 25 Neighborhood size Our analysis is broken down into two steps: collab- orative filtering evaluation and DBN evaluation. We Figure 2: CF prediction mean absolute error chose to evaluate the item-based collaborative filter- ing algorithm first to give a measurement of the algo- 54.5 rithm’s performance on an academic dataset. Testing Pearson Correlation Percent of misclassifed predictions 54 Cosine Distance of both collaborative filtering and DBNs is performed using 10-fold cross validation (partitioned randomly). 53.5 53 We are looking at two methods for evaluating the item- based collaborative filtering algorithm on this dataset: 52.5 mean absolute error and the percent of misclassified 52 predictions. Together, these statistics give us an indi- 51.5 cation of how far predictions are from actual grades and how often predictions are misclassified, respec- 51 tively. We selected these statistics because they are 50.5 fairly straightforward to interpret, and because mean 50 absolute error has been used in the past for collabora- tive filtering evaluation, allowing comparison to per- 49.5 0 5 10 15 20 25 formance on other datasets. Neighborhood size As a baseline for comparison of our 2-slice DBNs we generated baseline DBNs using more standard tech- Figure 3: Percent of misclassified predictions. niques. Baseline DBN structures were found through exhaustive search of the network structure space, us- ing Bayesian information criterion (BIC) [8] as a scor- 5.2 Collaborative filtering evaluation ing function. The highest scoring network was selected Figures 2 and 3 show that the two different distance for a specified neighborhood size, and parameters were metrics, Pearson’s correlation and cosine similarity, estimated using both maximum likelihood (ML) and yield DBNs that perform very similarly. These fig- Bayesian parameter estimation. Baseline DBNs were ures display the mean absolute error and percent of generated using the bnlearn software package [9]. misclassified predictions for both Pearson correlation As a means of evaluating the performance of the DBNs and cosine distance similarity metrics. we calculated the log-likelihood loss of the models, and Figure 2 shows that mean absolute error decreases the percent of misclassified predictions. Log-likelihood swiftly as the neighborhood size increases. After about loss is the negation of the log-likelihood, which we wish 11 neighbors this decrease slows and little change is to minimize. “Predictions” in this case are similar to observed as the number of predictors continues to in- the deterministic predictions made by a collaborative crease. This curve is similar to tests conducted on the filtering algorithm. We select the most likely outcome MovieLens dataset [7]. as a deterministic prediction and count the number that were classified correctly/incorrectly. This also Figure 3 shows how the percent of misclassified pre- gives us a basis for comparing our DBNs to the item- dictions changes as neighborhood size increases. At based collaborative filtering algorithm. first there is an abrupt jump in this percent, however afterward this curve resembles the curve for mean ab- 1.4 Bayes solute error, with an apparent ideal neighborhood size ML 1.35 of about 15 neighbors. Pearson Cosine Log−Likelihood Loss 1.3 5.3 DBN evaluation 1.25 Figures 4 and 5 show that the DBNs learned using col- 1.2 laborative filtering (Pearson and cosine) outperform the baseline DBNs. Figure 4 shows the log-likelihood 1.15 loss averaged over all models for a given neighborhood size. Figure 5 shows the percent of misclassified pre- 1.1 dictions for each model. Baseline DBNs are labeled 1.05 as “Bayes” and “ML” for their parameter estimation methods. Collaborative filtering inspired DBNs are la- 1 beled “Pearson” and “Cosine” for the distance metric 2 4 6 8 10 Neighborhood size used in the collaborative filtering algorithm. In terms of minimizing loss (Figure 4), the maximum- Figure 4: Average Log-Likelihood Loss. likelihood, Pearson, and cosine models show similar performance. At a neighborhood size of 2, these mod- 54 Bayes Percent of misclassifed predictions els have a log-likelihood loss tightly clustered around ML 1.14–1.16. Loss shows a steady decrease as the neigh- 52 Pearson borhood size increases. However, the Bayesian model Cosine appears unable to cope with increasing neighborhood 50 size, showing an increasing loss. This is likely due to the sparsity of data, and the increase in the possible 48 number of configurations that corresponds with an in- creased neighborhood size. 46 Classification accuracy (Figure 4) shows steady im- provement as neighborhood size increases across all 44 models, with collaborative filtering models showing much better accuracy than Bayes and maximum- likelihood models at all neighborhood sizes. At a 42 0 2 4 6 8 10 neighborhood size of only one the Pearson and co- Neighborhood size sine models show comparable accuracy (48.74-49.45% misclassified respectively) to the ML model Bayesian Figure 5: Observations misclassified by the DBN. model at a neighborhood size of 5 (49.52% misclassi- fied) and 7 (49.47% misclassified), respectively. At a neighborhood size of 10, the Pearson model shows the grades deviated from actual grades on a per-item ba- lowest misclassification rate at approximately 42.18%. sis (as we did with CPT generation in algorithm 1) one may be able to construct a better collaborative Comparing Figures 3 and 5, we find that at 6-7 parent filtering algorithm. variables, our baseline DBNs outperformed the item- based collaborative filtering algorithm. This is consis- Across all tests the Pearson model showed a slight ad- tent with other experiments that demonstrated that vantage over the cosine model. This indicates that Bayesian methods of classification showed better re- improvements in the item-based collaborative filtering sults than the standard item-based algorithm [10]. used to generate DBNs may lead to improvements in resulting DBN models. However, in terms of the percent of misclassified ob- servations the CF-based DBNs outperformed both our benchmarks and the item-based collaborative filtering 6 Conclusions and future directions algorithm that they were based on at all neighborhood sizes. At 17 neighbors the CF algorithm hit a misclas- Our goal is to develop DBN transition models for the sification rate of approximately 49.6%, however at a purpose of decision-theoretic planning. In this pa- neighborhood size of only 10 the CF-based DBNs had per we have presented a novel approach for gener- a misclassification rate of approximately 42.18%. This ating DBN planning models from sparse data. We indicates that by observing the way that predicted used academic advising data to show the validity of our method. One of the benefits of this method is [7] Badrul Sarwar, George Karypis, Joseph Konstan, the flexibility regarding the use of collaborative fil- and John Reidl. Item-based collaborative filter- tering recommendation algorithms. Our models were ing recommendation algorithms. In WWW ’01: constructed using a generic item-based collaborative Proceedings of the 10th International Conference filtering algorithm. Any similar item-based collabora- on World Wide Web, pages 285–295, New York, tive filtering algorithm can be used in its place, giving NY, USA, 2001. ACM. us a wide variety of algorithms which can be employed using off-the-shelf software packages. [8] Gideon Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978. We are also investigating methods for modeling util- ity in this and similar domains, as well as decision- [9] Marco Scutari. Learning Bayesian networks with theoretic planning algorithms that can run on domains the bnlearn R package. Journal of Statistical Soft- of the size and complexity presented here, and larger. ware, 35(3):1–22, 2010. [10] Xiaoyuan Su and Taghi M. Khoshgoftaar. Collab- Acknowledgments orative filtering for multi-class data using belief We thank the University’s SSTARS lab consultants for nets algorithms. In Proceedings of the 18th IEEE assistance with statistical analysis. We also thank Nick International Conference on Tools with Artificial Mattei, Robert Crawford, Daniel Michler and other Intelligence, ICTAI ’06, pages 497–504, 2006. members of our lab for their contributions to this and [11] Ioannis Tsamardinos, Laura E. Brown, and Con- related projects. stantin F. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. References Machine Learning, 65(1):31–78, 2006. [1] John Breese, David Heckerman, and Carl Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In Proceedings of the 14th Annual Conference on Uncertainty in Arti- ficial Intelligence (UAI-98), pages 43–52. Morgan Kaufmann, 1998. [2] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation. Com- putational Intelligence, 5(3):142–150, 1990. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38, 1977. [4] Nir Friedman, Iftach Nachman, and Dana Pe’er. Learning Bayesian network structure from mas- sive datasets: The “sparse candidate” algorithm. In The 15th Conference on Uncertainty in Artifi- cial Intelligence (UAI), pages 206–215, 1999. [5] Anna Goldenberg and Andrew Moore. Tractable learning of large Bayes net structures from sparse data. In ICML ’04: Proceedings of the 21st In- ternational Conference on Machine learning, New York, NY, USA, 2004. ACM. [6] Kaname Kojima, Eric Perrier, Seiya Imoto, and Satoru Miyano. Optimal search on clustered structural constraint for learning Bayesian net- work structure. Journal of Machine Learning Re- search, 11:285–310, 2010.