Constructing a Dynamic Bayes Net Model of Academic Advising


                                   Joshua T. Guerin and Judy Goldsmith∗
                                        Department of Computer Science
                                            University of Kentucky
                                           Lexington, KY 40506-0046
                                     jtguer2@uky.edu, goldsmit@cs.uky.edu


                     Abstract                              grade (1–5) or a grade based on letters or “stars” which
                                                           is easily mapped to numerical grade (for instance 1–5
    In this paper we apply ideas from collabora-           stars, or a letter grade of A–E).
    tive filtering to the problem of building dy-          CF algorithms can be roughly divided into model-
    namic Bayesian network (DBN) models for                based and memory-based algorithms. Model-based al-
    planning. We demonstrate that item-based               gorithms involve generating a predictive model based
    collaborative filtering can be used to con-            on the data and using it to make preference-related
    struct dynamic Bayesian networks for use               predictions. One formalism that has seen success in
    in large, factored domains with sparse data.           model-based CF is the Bayesian network [1].
    Such Bayesian networks can model the tran-
    sition function for decision-theoretic plan-           Memory-based CF operate over the database of items
    ning. We demonstrate the feasibility and ef-           to make predictions, leveraging a measure of similarity
    fectiveness of this technique on an academic           between users or (more commonly) between items to
    advising domain, based on student grades in            determine grades for unseen items. This class of algo-
    computer science and related courses at the            rithms provides us with several notable features which
    University of Kentucky.                                are useful for making predictions. Namely, these algo-
                                                           rithms are designed to operate over very large datasets
                                                           (common examples include the Netflix dataset, the
1   Introduction                                           MovieLens datasets, or the Amazon.com recommenda-
                                                           tion system). Such datasets typically contain tens of
In this paper we examine the use of memory-based CF        thousands of items and grades from hundreds of thou-
algorithms for constructing static models of data. This    sands of users, however since most users only provide
work is grounded in the real-world domain of academic      grades for a small percentage of items these datasets
advising. We use an item-based collaborative filtering     are very sparse. Because of this, modern recommenda-
algorithm to generate dynamic Bayesian network mod-        tion systems must scale well and must work well with
els of an advising domain from sparse grade data.          very sparse data.
Collaborative filtering (CF) algorithms are designed to
aggregate the opinions or preferences of a large num-      2   A Predictive Model for Academic
ber of users to extrapolate information about unnamed
                                                               Advising
preferences for new and existing users. Recommenda-
tion systems are constructed using CF techniques to
locate items in a database which a target user is likely   Reasoning in the domain of undergraduate academic
to prefer. Preferences are typically defined by grades     advising is often approached as a deterministic process.
that the user provides either explicitly (by the user      Short and long-term decision making is based on the
providing grades for items that have already seen) or      assumption that a student’s actions (i.e., taking one
implicitly (often indicated by patterns of behavior such   or more courses) will succeed. This doesn’t capture
as browsing habits). These grades can be represented       the nuances and complexity of the real world. The
in a number of ways, but are often numerical in na-        outcome of taking a course can not always be predicted
ture; most recommender systems ask for a numerical         with certainty; even a student who makes consistent
    ∗
                                                           A’s may perform poorly at some point.
      This material is based upon work supported by the
National Science Foundation under Grant No. 1049360.       Given the stochastic nature of grade prediction, it may
be desirable to construct statistics-based models of        narrow down choices based on a user’s preferences and
student performance from real world data. Students          the preferences of current and past users. A common
leave behind tangible evidence of progress in the form      example application is predicting preferences over un-
of transcript data. Universities amass a wealth of data     seen items (movies, music, groceries) based on grades
with which to make predictions about grades. From           given for other items [1].
this we can construct probabilistic predictive models.
                                                            The problem of grade prediction very closely resem-
The Dynamic Bayesian Network (DBN) formalism has
                                                            bles the problem of grade prediction in collaborative
a number of features which make it ideal for this sort
                                                            filtering: make predictions about a student’s grades in
of modeling.
                                                            untaken courses, given their past grades and the tran-
A DBN model consists of a directed acyclic graph with       script data from many past students. Letter “grades”
links representing temporal, probabilistic relationships    can map directly to integers where A=1 and Fail-
between variables and conditional probability tables        ure=5.
(CPTs) that specify those relationships quantitatively
                                                            In this paper we present a simple collaborative filtering
[2] (a discussion of DBNs will follow in Section 3.2).
                                                            algorithm, and demonstrate how it is used to generate
We are interested in a class of DBNs which model only
                                                            a valid DBN model of state transitions in the advising
a single time-step known as 2-slice DBNs. This im-
                                                            domain. We use real-world data from the Computer
poses restrictions on the underlying graphical struc-
                                                            Science Department at our university as a testbed for
ture. Specifically, variable values at one time-step are
                                                            our model generation techniques.
conditioned only on the values of parent variables at
the previous time-step.
The structural restrictions imposed on 2-slice DBNs         3     Background
make them a potentially compact representation for
decision theoretic planning. For this reason we limit       3.1   Bayesian Networks
our attention to 2-slice DBNs.
                                                            A Bayesian network is a directed acyclic graph G =
In the case of discrete-valued variables, each child node   hV, Ei, where each vertex v ∈ V is a variable with
in the DBN has an associated conditional probability        domain dom(v). Each v ∈ V has an associated proba-
table (CPT) which gives a probability distribution over     bility distribution over values in dom(v), conditioned
possible values for every possible assignment to parent     on the values of P av ⊂ V , the parents of v. These con-
variables (incoming edges in the graph) at a previous       ditional probability distributions are usually enumer-
time-step. Because all possible assignments to parent       ated in tabular form as conditional probability tables
variables may need to be enumerated explicitly, CPT         (CPTs) for each variable.
size is exponential in the number of parent variables.
For example, a CPT for a single course with 5 parents,      Learning of Bayesian networks is often divided into
each of which has 6 possible values (A–D, Failure, and      structure learning and parameter learning. Structure
NOT TAKEN) will have 65 = 7, 776 rows, each con-            learning is the problem of learning the graphical struc-
taining a probability distribution over the 6 possible      ture E by discovering predictive or causal dependen-
outcomes.                                                   cies between variables. Parameter learning is the prob-
                                                            lem of learning the conditional probability distribu-
For modern computers, tables of this size are unlikely      tions for a given network structure.
to cause representational issues. However the need for
enough data to populate a table’s 66 = 46, 656 proba-       Because the space of all possible networks is very large,
bilities makes seemingly abundant data seem rather          structure learning is usually approached as a heuristic
sparse. Popular or required courses may be taken            search problem or an exact search of a constrained
by hundreds or even thousands of students within            version of the search space (see [4,6,11] for examples).
the span of several years, but even this is insuffi-        Search for an an optimal (or near optimal) network
cient to derive realistic probability distributions from    structure is guided by some scoring function (one ex-
straight statistical analysis. This problem is worse for    ample is the log-likelihood scoring function).
most courses (and for smaller colleges and universi-        Once structure is known, CPT parameters (probabil-
ties) where enrollment over several years may reach         ity distributions over outcomes) are generally learned
only hundreds of students or fewer.                         from the data. Examples of parameter learning for
In order to deal with the problem of prediction when        DBNs include maximum likelihood estimation (one ex-
data is sparse, we turn to techniques from collaborative    ample being the expectation maximization [3] algo-
filtering to aggregate the data that is available. Col-     rithm), or Bayesian estimation.
laborative filtering algorithms are commonly used to        Unlike most Bayesian network learning algorithms, our
                INTR                 INTR'
                                                           Rather than selecting a single ideal structural size we
                                                           choose to make structure size a parameter of our al-
                                                           gorithm. Since we are considering models for the pur-
                                                           pose of planning, we must consider the tradeoff be-
                DISC                 DISC'
                                                           tween accuracy of the representation and tractability
                                                           of planning. Our goal is to be able to generate DBNs
                                                           of different sizes for different purposes. We examine
                INUM                INUM'
                                                           how our algorithm fares as a function of structure size
                                                           in Section 5. At this point we are left with the question
                                                           of how to select n parent nodes for each node.
                LTC                  LTC'
                                                           Goldenberg et al. approached a similar problem of
                                                           learning Bayesian network structures from sparse data
        Figure 1: An example DBN structure.                using frequent set mining [5]. Frequent sets are
                                                           widely used in data mining for learning common co-
                                                           occurrence between sets of items. The idea of applying
validation is based on the quality of predictions rather   frequent set mining to academic advising may be useful
than of inference. In other words, our work looks          in other capacities (learning combinations of courses
forward in time rather than backward. We conjec-           which should or should not be taken together), how-
ture that good learned probabilistic planning models       ever co-occurrence of actions is less applicable to build-
may actually differ from probabilistic inference models    ing predictive models of advising; courses which are
learned from the same data.                                frequently taken together are unlikely to make good
                                                           predictors for each other. Parent courses should be
                                                           taken before child courses, otherwise they provide lit-
3.2   2-Slice Dynamic Bayesian Networks                    tle information.

Bayesian networks have been demonstrated to be use-        Rather than using co-occurance we make the assump-
ful for inference in a number of domains, however the      tion that similar variables make better predictors than
standard framework does not have an explicit notion        dissimilar variables. We examine the use of pairwise
of time. A dynamic Bayesian network builds upon the        item similarity in selecting parent nodes. Item simi-
Bayesian network idea, incorporating temporal or se-       larity is commonly used in collaborative filtering and
quential aspects of data into its structure. Variables     other data mining applications to determine which
at one time-step may influence the value of variables      items hold the most predictive power for a target item,
at future time-steps (or at the same time-step).           allowing for better predictions to be made.

We are interested in a special case of dynamic Bayesian    One of the most common approaches for collaborative
networks, the 2-slice dynamic Bayesian network. A 2-       filtering is to use the database of user grades to deter-
slice dynamic Bayesian network is a Bayesian network       mine item-item similarity. For each pair of items in
with V = V, V 0 , representing variables at time t and     the database a vector of grades is created (retaining
t + 1, and edges from V to V 0 (and sometimes between      only grades where users voted for both items) [7]. To
vertices in V 0 ). In DBNs of this form, V and V 0 may     these vectors a number of distance metrics can be ap-
be visualized as two separate columns representing,        plied. In our implementation we tested two common
respectively, the variables at time t and t + 1.           vector similarity metrics: Pearson’s correlation coeffi-
                                                           cient and cosine similarity.
This structural formulation implies two theoretical
assumptions under which we operate. These are a            3.3   Collaborative Filtering
stationary assumption where models are not time-
dependant and a Markov assumption where there is no        Collaborative filtering recommendation algorithms
memory of past states; future values are conditioned       typically fall into one of two general categories: model-
only over the current system state.                        based algorithms and memory-based algorithms [1].
                                                           Model-based algorithms involve generating a model
Figure 1 gives the structure of an example of a 2-slice
                                                           based on data, and using the model to make predic-
DBN which could be used for planning in an academic
                                                           tions. We are interested in memory-based algorithms
domain. This DBN structure shows that the expected
                                                           which use the entire data set to make predictions. This
grade in Logic and Theory of Computing (LTC) is con-
                                                           class of algorithms is described in Section 3.4.
ditioned over the grades obtained in Introduction to
Programming (INTR), Discrete Mathematics (DISC),           Collaborative filtering algorithms also rely heavily on
and Introduction to Numerical Methods (INUM).              the notion of similarity. That is, similar users are likely
to assign similar grades to items. Likewise, similar        After we construct grade distribution we normalize
items may also be given similar grades. Collaborative       each row of the table to form probability distributions.
filtering systems often employ one of these assump-         For a grade g, row grade distribution[g] is now a prob-
tions. These are known as user-based and item-based         ability distribution over actual grades when R predicts
collaborative filtering. In this paper we focus on the      g.
use of item-based collaborative filtering because of the
performance demonstrated by these algorithms and            Input: Past Users - a database of past user grades.
because of their user-independent nature.                   Output: CPT - A set of CPTs for each course
                                                            foreach user in Past Users do
3.4   Item-Based Collaborative Filtering                       foreach item in user’s graded items do
                                                                  p = puser,item ;
The collaborative filtering algorithm that we used in
                                                                  actual = actual grade for item;
this paper is an item-based algorithm presented by
                                                                  grade distribution[p][actual]++;
Sarwar et al. [7]. First, item-item similarity is cal-
                                                               end
culated over all items in the database. For item-item
similarity we are using Pearson’s correlation coefficient   end
and cosine similarity. For a user u and an item i, pre-     normalize rows of grade distribution;
dictions are made using the weighted sum of u’s grades      foreach item do
for all items which are similar to i. This can be ex-          T = create prediction table for item;
pressed as:                                                    foreach row in T do
               P                                                  u* = temporary user using grade assignments
                 all similar items,N (si,N ∗ Ru,N )               in row;
        pu,i =     P                                . (1)         p = puser,item ;
                     all similar items,N (|si,N |)
                                                                  add distribution from grade distribution[p] to
                                                                  current row of T;
Here, pu,i is the predicted grade that user u might give
                                                               end
item i, si,N is the similarity between items i and N ,
                                                               CPT(course) = T;
and Ru,N is the grade that u provided for item N .
                                                            end
Equation 1 produces a single, most likely grade for          Algorithm 1: Generate DBNs from CF predictions
the given user and item. Because a DBN requires a
probability distribution over all possible grades, we
are not yet ready to encode our DBNs.                       The second half of our algorithm constructs a set of
                                                            prediction tables for each course. A prediction table
                                                            T for a course c reflects the overall structure of a fi-
4     Algorithm Details                                     nal CPT for c; each row of T contains a set of val-
                                                            ues for parent variables (defined by δ and our distance
The CF algorithm based on the function pu,i described       metric). For each row of T , we fill in the probability
in Section 3.4 defines a deterministic version of the       distribution over grades using the appropriate row of
DBN CPTs that we want to generate. We use these             grade distribution.
predictions and the data from past students’ tran-
scripts to generate probability distributions over pos-     Each row of the prediction table T implies a hypo-
sible grades to produce full CPTs. Algorithm 1 de-          thetical user transcript as an assignment over past
scribes the process of turning deterministic predictions    grades. Using R we can make a prediction p for
from pu,i into CPTs.                                        each row. We select a probability distribution from
                                                            grade distribution[p], adding probability distributions
In this algorithm we make the assumption that devia-        over grades to each row of T .
tion from predictions in past data will produce a dis-
tribution which is a reasonable approximation of the        After completion, CP T is a set of CPTs for each
probability distribution.                                   course, where CP T (c) is the CP T for course c.
Given the predictions from the CF function described
in 4, we build a distribution table, grade distribution,    5   Results
for the set of items with rows and columns indexed by
predicted and actual grades. If G1 and G2 are possible      In this section we describe the tests we run on the
grades, then the grade distribution[G1 ][G2 ] entry in      academic advising data. We evaluate the two variants
the table is the number of transcripts for which the        of the item-based collaborative filtering algorithm on
CF algorithm predicted G1 and the student received          this dataset. We also generate two baseline DBN mod-
G2 for the class in question.                               els and two collaborative filtering based models, and
analyze their performance on this dataset.                                                        0.72
                                                                                                                                     Pearson Correlation
                                                                                                                                     Cosine Distance
                                                                                                   0.7


                                                            Mean absolute error
5.1   Data and Experimental Setup
                                                                                                  0.68

Models are generated from the transcript data for
                                                                                                  0.66
approximately 4760 undergraduate students who en-
rolled during the 2000–2003 academic years. These
anonymized data are a time-stamped (semester and                                                  0.64
year) series of transactions labeled with course and in-
structor information and grade outcomes. Because we                                               0.62
have meta-data from computer science courses, we re-
stricted our attention to students who took computer                                               0.6
science courses during their academic careers.                                                        0       5        10       15          20         25
                                                                                                                   Neighborhood size
Our analysis is broken down into two steps: collab-
orative filtering evaluation and DBN evaluation. We                                                Figure 2: CF prediction mean absolute error
chose to evaluate the item-based collaborative filter-
ing algorithm first to give a measurement of the algo-                                            54.5
rithm’s performance on an academic dataset. Testing                                                                                  Pearson Correlation


                                                            Percent of misclassifed predictions
                                                                                                   54                                Cosine Distance
of both collaborative filtering and DBNs is performed
using 10-fold cross validation (partitioned randomly).                                            53.5

                                                                                                   53
We are looking at two methods for evaluating the item-
based collaborative filtering algorithm on this dataset:                                          52.5
mean absolute error and the percent of misclassified                                               52
predictions. Together, these statistics give us an indi-
                                                                                                  51.5
cation of how far predictions are from actual grades
and how often predictions are misclassified, respec-                                               51
tively. We selected these statistics because they are                                             50.5
fairly straightforward to interpret, and because mean
                                                                                                   50
absolute error has been used in the past for collabora-
tive filtering evaluation, allowing comparison to per-                                            49.5
                                                                                                      0       5        10       15          20         25
formance on other datasets.                                                                                        Neighborhood size
As a baseline for comparison of our 2-slice DBNs we
generated baseline DBNs using more standard tech-                                                 Figure 3: Percent of misclassified predictions.
niques. Baseline DBN structures were found through
exhaustive search of the network structure space, us-
ing Bayesian information criterion (BIC) [8] as a scor-     5.2                                      Collaborative filtering evaluation
ing function. The highest scoring network was selected
                                                            Figures 2 and 3 show that the two different distance
for a specified neighborhood size, and parameters were
                                                            metrics, Pearson’s correlation and cosine similarity,
estimated using both maximum likelihood (ML) and
                                                            yield DBNs that perform very similarly. These fig-
Bayesian parameter estimation. Baseline DBNs were
                                                            ures display the mean absolute error and percent of
generated using the bnlearn software package [9].
                                                            misclassified predictions for both Pearson correlation
As a means of evaluating the performance of the DBNs        and cosine distance similarity metrics.
we calculated the log-likelihood loss of the models, and
                                                            Figure 2 shows that mean absolute error decreases
the percent of misclassified predictions. Log-likelihood
                                                            swiftly as the neighborhood size increases. After about
loss is the negation of the log-likelihood, which we wish
                                                            11 neighbors this decrease slows and little change is
to minimize. “Predictions” in this case are similar to
                                                            observed as the number of predictors continues to in-
the deterministic predictions made by a collaborative
                                                            crease. This curve is similar to tests conducted on the
filtering algorithm. We select the most likely outcome
                                                            MovieLens dataset [7].
as a deterministic prediction and count the number
that were classified correctly/incorrectly. This also       Figure 3 shows how the percent of misclassified pre-
gives us a basis for comparing our DBNs to the item-        dictions changes as neighborhood size increases. At
based collaborative filtering algorithm.                    first there is an abrupt jump in this percent, however
afterward this curve resembles the curve for mean ab-                                                    1.4
                                                                                                                                                          Bayes
solute error, with an apparent ideal neighborhood size                                                                                                    ML
                                                                                              1.35
of about 15 neighbors.                                                                                                                                    Pearson
                                                                                                                                                          Cosine


                                                           Log−Likelihood Loss
                                                                                                         1.3
5.3   DBN evaluation
                                                                                              1.25
Figures 4 and 5 show that the DBNs learned using col-
                                                                                                         1.2
laborative filtering (Pearson and cosine) outperform
the baseline DBNs. Figure 4 shows the log-likelihood                                          1.15
loss averaged over all models for a given neighborhood
size. Figure 5 shows the percent of misclassified pre-                                                   1.1
dictions for each model. Baseline DBNs are labeled
                                                                                              1.05
as “Bayes” and “ML” for their parameter estimation
methods. Collaborative filtering inspired DBNs are la-                                                    1
beled “Pearson” and “Cosine” for the distance metric                                                       2              4           6           8                 10
                                                                                                                              Neighborhood size
used in the collaborative filtering algorithm.
In terms of minimizing loss (Figure 4), the maximum-                                                           Figure 4: Average Log-Likelihood Loss.
likelihood, Pearson, and cosine models show similar
performance. At a neighborhood size of 2, these mod-                                                     54
                                                                                                                                                          Bayes


                                                                   Percent of misclassifed predictions
els have a log-likelihood loss tightly clustered around
                                                                                                                                                          ML
1.14–1.16. Loss shows a steady decrease as the neigh-                                                    52                                               Pearson
borhood size increases. However, the Bayesian model                                                                                                       Cosine
appears unable to cope with increasing neighborhood
                                                                                                         50
size, showing an increasing loss. This is likely due to
the sparsity of data, and the increase in the possible
                                                                                                         48
number of configurations that corresponds with an in-
creased neighborhood size.
                                                                                                         46
Classification accuracy (Figure 4) shows steady im-
provement as neighborhood size increases across all
                                                                                                         44
models, with collaborative filtering models showing
much better accuracy than Bayes and maximum-
likelihood models at all neighborhood sizes. At a                                                        42
                                                                                                           0          2          4        6           8         10
neighborhood size of only one the Pearson and co-                                                                             Neighborhood size
sine models show comparable accuracy (48.74-49.45%
misclassified respectively) to the ML model Bayesian                            Figure 5: Observations misclassified by the DBN.
model at a neighborhood size of 5 (49.52% misclassi-
fied) and 7 (49.47% misclassified), respectively. At a
neighborhood size of 10, the Pearson model shows the       grades deviated from actual grades on a per-item ba-
lowest misclassification rate at approximately 42.18%.     sis (as we did with CPT generation in algorithm 1)
                                                           one may be able to construct a better collaborative
Comparing Figures 3 and 5, we find that at 6-7 parent      filtering algorithm.
variables, our baseline DBNs outperformed the item-
based collaborative filtering algorithm. This is consis-   Across all tests the Pearson model showed a slight ad-
tent with other experiments that demonstrated that         vantage over the cosine model. This indicates that
Bayesian methods of classification showed better re-       improvements in the item-based collaborative filtering
sults than the standard item-based algorithm [10].         used to generate DBNs may lead to improvements in
                                                           resulting DBN models.
However, in terms of the percent of misclassified ob-
servations the CF-based DBNs outperformed both our
benchmarks and the item-based collaborative filtering      6                                              Conclusions and future directions
algorithm that they were based on at all neighborhood
sizes. At 17 neighbors the CF algorithm hit a misclas-     Our goal is to develop DBN transition models for the
sification rate of approximately 49.6%, however at a       purpose of decision-theoretic planning. In this pa-
neighborhood size of only 10 the CF-based DBNs had         per we have presented a novel approach for gener-
a misclassification rate of approximately 42.18%. This     ating DBN planning models from sparse data. We
indicates that by observing the way that predicted         used academic advising data to show the validity of
our method. One of the benefits of this method is            [7] Badrul Sarwar, George Karypis, Joseph Konstan,
the flexibility regarding the use of collaborative fil-          and John Reidl. Item-based collaborative filter-
tering recommendation algorithms. Our models were                ing recommendation algorithms. In WWW ’01:
constructed using a generic item-based collaborative             Proceedings of the 10th International Conference
filtering algorithm. Any similar item-based collabora-           on World Wide Web, pages 285–295, New York,
tive filtering algorithm can be used in its place, giving        NY, USA, 2001. ACM.
us a wide variety of algorithms which can be employed
using off-the-shelf software packages.                       [8] Gideon Schwarz. Estimating the dimension of a
                                                                 model. Annals of Statistics, 6(2):461–464, 1978.
We are also investigating methods for modeling util-
ity in this and similar domains, as well as decision-        [9] Marco Scutari. Learning Bayesian networks with
theoretic planning algorithms that can run on domains            the bnlearn R package. Journal of Statistical Soft-
of the size and complexity presented here, and larger.           ware, 35(3):1–22, 2010.
                                                            [10] Xiaoyuan Su and Taghi M. Khoshgoftaar. Collab-
Acknowledgments                                                  orative filtering for multi-class data using belief
We thank the University’s SSTARS lab consultants for             nets algorithms. In Proceedings of the 18th IEEE
assistance with statistical analysis. We also thank Nick         International Conference on Tools with Artificial
Mattei, Robert Crawford, Daniel Michler and other                Intelligence, ICTAI ’06, pages 497–504, 2006.
members of our lab for their contributions to this and      [11] Ioannis Tsamardinos, Laura E. Brown, and Con-
related projects.                                                stantin F. Aliferis. The max-min hill-climbing
                                                                 Bayesian network structure learning algorithm.
References                                                       Machine Learning, 65(1):31–78, 2006.

 [1] John Breese, David Heckerman, and Carl Kadie.
     Empirical analysis of predictive algorithms for
     collaborative filtering. In Proceedings of the
     14th Annual Conference on Uncertainty in Arti-
     ficial Intelligence (UAI-98), pages 43–52. Morgan
     Kaufmann, 1998.

 [2] Thomas Dean and Keiji Kanazawa. A model for
     reasoning about persistence and causation. Com-
     putational Intelligence, 5(3):142–150, 1990.

 [3] A. P. Dempster, N. M. Laird, and D. B. Rubin.
     Maximum likelihood from incomplete data via the
     EM algorithm. Journal of the Royal Statistical
     Society, Series B, 39(1):1–38, 1977.

 [4] Nir Friedman, Iftach Nachman, and Dana Pe’er.
     Learning Bayesian network structure from mas-
     sive datasets: The “sparse candidate” algorithm.
     In The 15th Conference on Uncertainty in Artifi-
     cial Intelligence (UAI), pages 206–215, 1999.

 [5] Anna Goldenberg and Andrew Moore. Tractable
     learning of large Bayes net structures from sparse
     data. In ICML ’04: Proceedings of the 21st In-
     ternational Conference on Machine learning, New
     York, NY, USA, 2004. ACM.

 [6] Kaname Kojima, Eric Perrier, Seiya Imoto, and
     Satoru Miyano. Optimal search on clustered
     structural constraint for learning Bayesian net-
     work structure. Journal of Machine Learning Re-
     search, 11:285–310, 2010.