=Paper= {{Paper |id=None |storemode=property |title=Conjoint Modeling of Temporal Dependencies in Event Streams |pdfUrl=https://ceur-ws.org/Vol-962/paper08.pdf |volume=Vol-962 |dblpUrl=https://dblp.org/rec/conf/uai/ParikhGM12 }} ==Conjoint Modeling of Temporal Dependencies in Event Streams== https://ceur-ws.org/Vol-962/paper08.pdf
    Conjoint Modeling of Temporal Dependencies in Event Streams



            Ankur P. Parikh∗†               Asela Gunawardana∗                    Christopher Meek
        School of Computer Science            Microsoft Research                   Microsoft Research
        Carnegie Mellon University            One Microsoft Way                    One Microsoft Way
        Pittsburgh, PA 15213, USA          Redmond, WA 98052, USA               Redmond, WA 98052, USA
          apparikh@cs.cmu.edu               aselag@microsoft.com                  meek@microsoft.com


                      Abstract                            a machine failing may depend on cascades of various
                                                          prior warnings, errors, and failures.
    Many real world applications depend on                In many domains, it is valuable to model fine distinc-
    modeling the temporal dynamics of streams             tions between event types. For example, in targeted
    of diverse events, many of which are rare.            advertising, it is valuable to distinguish whether a user
    We introduce a novel model class, Con-                will issue queries related to mental health or to back
    joint Piecewise-Constant Conditional Inten-           pain rather than simply predicting that a user will
    sity Models, and a learning algorithm that            issue a healthcare query. In a data center, it is more
    together yield a data-driven approach to pa-          useful to learn that particular disk errors predict failed
    rameter sharing with the aim of better mod-           reboots than to know that generic error messages pre-
    eling such event streams. We empirically              dict generic failures. While useful to model, such fine
    demonstrate that our approach yields more             grained event types are rarer than the coarser grained
    accurate models of two real world data sets:          ones that include them, and are therefore more difficult
    search query logs and data center system              to model. Many models of temporal dependencies in
    logs.                                                 event streams, such as the Piecewise-Constant Condi-
                                                          tional Intensity Model (PCIM) [8], learn the dependen-
                                                          cies of each event type separately, using independent
                                                          sub-models for each event type. Thus, they are not
1       Introduction
                                                          able to model rare events well.
Event streams—temporal sequences of discrete events,      In this paper, we address this problem using pa-
are ubiquitous in many domains such as the firing         rameter sharing, by introducing Conjoint Piecewise-
patterns of neurons [2], gene expression data [7], sys-   Constant Conditional Intensity Models (C-PCIMs)
tem error logs [13], and web search engine query logs.    and a learning algorithm for C-PCIMs, which yield
Learning a model for the temporal dependencies be-        a novel data-driven approach to modeling fine grained
tween events can be useful for understanding and ex-      event streams. C-PCIMs generalize PCIMs by allow-
ploiting the dynamics in such domains. For example,       ing parameters to be shared across event types, and
learning that particular system events on a machine       our learning algorithm uses the data to determine
predict failures at a later time may allow a system ad-   which parameters should be shared. In particular, we
ministrator to prioritize preventive maintenance. Un-     give a conjugate prior that allows parameter learning
derstanding how web search queries for commercially       for the C-PCIM to be performed as efficiently as for
valuable terms are dependent on prior queries, possi-     the PCIM, and that leads to a closed-form marginal
bly for other topics, can help in targeted advertising.   likelihood, allowing efficient structure learning. Dur-
In many cases the data exhibits complex temporal de-      ing structure learning, the C-PCIM learns what event
pendencies. For example, what a user will query for       types in what historical contexts can be modeled by
at a particular time can depend on queries issued in      shared parameters, thereby allowing more efficient use
the last few minutes, the previous day, as well as in     of data during parameter estimation. In cases where
the extended past. In a data center, the likelihood of    events are structured, with the different event types
                                                          having known attributes, we show how structure learn-
    ∗
     These authors contributed equally to the work.       ing can take advantage of these attributes to distin-
    †
     This research was conducted while APP was a re-      guish between different event types when their depen-
search intern at Microsoft Research, Redmond.
dencies differ, while sharing parameters when they do       nonparametric Bayesian approach to conjoint model-
not. Finally, we give empirical evidence that demon-        ing in Markov processes such as CTBNs. Regulariza-
strates the value of C-PCIMs in two real applications       tion approaches may also be used for conjoint model-
which are not well addressed by existing approaches–        ing [25], although we are unaware of applications to
modeling the temporal query dynamics of web search          continuous time event modeling.
users and modeling the temporal dynamics of system
events in a data center. In the second application, we      3     The Model
demonstrate that the expressive power of C-PCIMs
combined with the data driven learning approach al-         We represent an event sequence as y = {(ti , li )}i=1
                                                                                                                                   n
lows us to relax the strong assumption of identical ma-     with 0 < t1 < · · · < tn , where ti ∈ [0, ∞) is the time
chines, yielding further accuracy improvements.             of the ith event and li is its label, drawn from a finite
                                                            label set L. The history at time t of event sequence y is
2   Related Work                                            the sub-sequence h(t, y) = {(ti , li ) | (ti , li ) ∈ y, ti ≤ t}.
                                                            We write hi for h(ti−1 , y) when it is clear from context
Event streams can be modeled in either discrete or          which y is meant. By convention t0 = 0. We define
continuous time. Using discrete time approaches such        the ending time t(y) of an event sequence y as the time
as Hidden Markov Models (HMMs) [1, 14] and Dy-              of the last event in y: t(y) = max ({t : (t, l) ∈ y}) so
namic Bayesian Networks (DBNs) [5] require event            that t(hi ) = ti−1 .
times to be discretized, which requires a choice of sam-
                                                            The data x, which is a particular event sequence, is
pling rate, and with it, trade-offs involving fidelity of
                                                            modeled as a realization of a regular marked point pro-
representation, time-span of dependencies, and com-
                                                            cess [4, 6] with likelihood
putational cost. We avoid this choice, and model
event streams in continuous time. There have been a                            n
                                                                              YY
number of recent approaches for modeling continuous-               p(x|θ) =             λl (ti |hi , θ)1l (li ) e−Λl (ti |hi ;θ)   (1)
time processes. C-PCIMs, like PCIMs [8], model event                          l∈L i=1

streams as marked point processes, where events have        where λl (t|h; θ) is the conditional intensity function [4]
both an arrival time and a label specifying the type                                           Rt
                                                            for label l, and Λl (t|h; θ) = t(h) λl (τ |h; θ)dτ . We
of event, via conditional intensity functions. A num-
ber of other closely related approaches [15, 23, 24] use    write 1Z (·) for the indicator function of a set Z and
regression techniques such as generalized linear mod-       1z (·) for the indicator of the singleton {z}. Intuitively,
els, Cox regression and Aalen regression to model con-      λl (t|h; θ) is the expected rate of events with label l at
ditional intensity functions. Continuous Time Noisy-        time t given the history h. Note that despite the sim-
Or [21] and Poisson cascades [22] are also approaches       ilarity to the likelihood of a non-homogeneous Pois-
for modeling event streams. These approaches do not         son process, this likelihood does not in general define
address model selection, and require a parametric form      a Poisson process as the conditioning on history can
for temporal dependencies to be specified. Predic-          cause the independent increments property of Poisson
tive performance is strongly impacted by this modeling      processes to not hold. The conditioning on the en-
choice, which is domain dependent [21, 22]. There has       tire history also means that such processes are non-
also been some recent work on nonparametric Bayesian        Markovian. Piecewise Constant Conditional Intensity
approaches for modeling unlabeled event streams [17].       Models (PCIMs) [8] are a particular class of marked
Continuous Time Bayesian Networks (CTBNs) [12]              point process where the conditional intensity functions
and Markov Jump Processes [16] are Markov process           are restricted to be piecewise constant. In this paper,
models of the trajectories of discrete variables over       we introduce Conjoint PCIMs which are PCIMs that
continuous time. In contrast to PCIMs and C-PCIMs,          use a conjoint representation for the conditional inten-
they are Markov process models. A CTBN can be               sity functions. These models are described below.
used to model an event stream by modeling each kind
of event as a transition of a “toggle” variable [20], and   3.1    PCIMs
using latent state variables to model their dynamics,
                                                            In this section, we review PCIMs [8]. PCIMs are a
to give a continuous time analog of an HMM.
                                                            class of marked point process where the conditional
Conjoint PCIMs differ from PCIMs in the way that            intensity function for each label is a piecewise con-
parameter are shared. Such approaches have been             stant function of time, taking one of a finite number
used in other problems such as for building hidden          of values. That is λl (t|y) is piecewise constant in t
Markov models with large state spaces [9, 10], and for      for all t > t(y), and takes on values {λls } for s ∈ Σl ,
building n-gram language models [11]. Hierarchical          where Σl is a finite label-dependent state set. The
Gamma-Exponential processes [18] are a hierarchical         value λls taken on by λl (t|y) at t for each y is specified
by a piecewise constant state function σl (t, y), so that
λl (t|y) = λlσl (t,y) .                                                                      A in
                                                                                            [t-1,t)
Note that the state s summarizes all the information
about t and y necessary for computing λl (t|y) in that                                    no       yes
given s, λl (t|y) can be computed without further in-
formation about t and y. However, unlike in Markov                                                      A in
                                                                                  λA=0.1
models such as CTBNs [12], the state does not contain                                                 [t-2,t-1)
all the information about t and y for predicting future
states.                                                                                            no         yes

As described above, the conditional intensity function
                                                                                           λA=10.0             λA=0.0
λl (t|y) can be specified by a structure Sl = (Σl , σl (·, ·))
consisting of the state set Σl and the state function
σl (·, ·) and a parameter parameter vector θl composed
of a non-negative intensity λls for each s ∈ Σl . In turn,
                                                                                                   B in
a PCIM is specified by a structure S and a parameter
                                                                                                  [t-1,t)
Θ that consist of the per-label structures Sl and per-
label parameter vectors θl .                                                              no                 yes
Gunawardana et al. show [8] that given the structure
                                                                                A in                                  B in
S, a product of Gamma distributions is a conjugate
                                                                               [t-5,t)                              [t-2,t-1)
prior for Θ, and that under this prior, the marginal
likelihood of the data can be given in closed form.                         no      yes                             no    yes
Thus, parameter estimation can be done in closed form
given a structure, and imposing a structural prior al-
                                                                      λB=0.0             λB=0.1             λB=10.0         λB=0.0
lows a closed form Bayesian score to be computed for
a structure.
The structure of a PCIM can be represented by a set
of decision trees [8]. In particular, each state function        Figure 1: Example Decision trees representing a PCIM
σl can be represented by a decision tree whose leaves            for a problem with L = {A, B}.
represent the states s ∈ Σl , as shown in the example
of Figure 1. The decision nodes in each tree contain
functions that map a time t and a history y to one of            fore allow an intensity value λs to be shared across con-
its child nodes. These functions are piecewise constant          ditional intensity functions for different labels, possi-
in time, so that the state function σl (t, y) represented        bly at different times and for different histories. Thus,
by a decision tree is also piecewise constant. Struc-            a C-PCIM is defined by a structure S = (Σ, σ(·, ·, ·))
ture learning for each label l is performed by starting          consisting of a state set Σ and a state function σ(·, ·, ·)
with the trivial tree, and iteratively refining it greed-        as well as a parameter vector Θ = {λs }s∈Σ , all of which
ily based on the closed form Bayesian score mentioned            are shared across labels l ∈ L.
above. A detailed presentation of this learning proce-
dure as generalized to C-PCIMs is given in section 4.1.          We use a decision tree representation of the structure
                                                                 S of a C-PCIM. However, instead of using a different
3.2   Conjoint PCIMs                                             decision tree for each label l as in Gunawardana et
                                                                 al. [8], we use a single tree that is used across all la-
Conjoint PCIMs (C-PCIMs), like PCIMs, are marked                 bels, as shown in the example of Figure 2. The leaves
point processes where the conditional intensity func-            of the tree represent states s ∈ Σ. However, the de-
tion λl (t|y) are piecewise constant and take on a finite        cision nodes of a C-PCIM tree contain functions that
number of values, but unlike in PCIMs, the conditional           can depend on l as well as on t and y. In particu-
intensity functions in C-PCIMs take on values from a             lar, each decision node contains a basis state function
single set of values shared across all labels l ∈ L. Thus        f chosen from a given set B. Each basis state func-
λl (t|y) takes on values {λs } for s ∈ Σ which are shared        tion f (l, t, y) is piecewise constant in t for each l and
across all l. Which of these values is taken on by λl (t|y)      y, and takes values from a finite basis state set Σf .
is specified by a C-PCIM state function σ(l, t, y) which         Thus, a decision node with basis state function f has
unlike PCIM state functions, is also a function of the           a child node corresponding to each s0 ∈ Σf . Since the
label l whose conditional intensity function is being            basis state functions are defined to be valid piecewise
evaluated. Thus, λl (t|y) = λσ(l,t,y) . C-PCIMs there-           constant state functions, the mapping from (l, t, y) to
leaves given by the tree is also a valid piecewise con-                     In our experiments, we obtain the point estimate Θ̂ =
stant state function σ(l, t, y).                                            E[Θ|S, x] from the training data, given by
                                                                                                          α + cs (x)
                                                                                                λ̂s =                .
                                        l in                                                              β + ds (x)
                                      [t-1,t)
                                                                            4.1   Structure Learning
                                 no              yes
                                                                            For structure learning, we can write the marginal like-
                                                          l in              lihood of the data x given the structure S in closed
                   l=A
                                                       [t-2,t-1)            form as
                                                                                              Y βα        Γ(α + cs (x))
                no         yes                         no    yes                    p(x|S) =                                .
                                                                                                  Γ(α) (β + ds (x))α+cs (x)
                                                                                              s∈Σ |
              A in
                                                                                                            {z            }
                             λl=0.1             λl=10.0            λl=0.0                                        γs (x)
             [t-5,t)
                                                                            As with PCIMs, we use a Bayesian decision tree build-
         no       yes
                                                                            ing procedure [3] in order to learn the structure S. We
                                                                            begin with the trivial structure Σ = s0 , σ(l, t, y) = s0
    λl=0.0             λl=0.1                                               with only the root node s0 , and refine the structure
                                                                            S by iteratively splitting leaves s ∈ Σ based on basis
                                                                            state functions f ∈ B. In particular Given a current
                                                                            structure S = (Σ, σ), a new structure S 0 = (Σ0 , σ 0 ) is
Figure 2: Decision tree representing a C-PCIM equiv-                        produced by selecting a state s ∈ Σl and a basis state
alent to the example PCIM of Figure 1.                                      function f ∈ B and refining s based on f as follows:
                                                                                                             
4   Learning Conjoint PCIMs                                                              Σ0l = 
                                                                                                  [
                                                                                                      {s s0 } ∪ Σl \s
                                                                                                 s0 ∈Σf
In this section, we directly generalize the parameter                                           (
and structure learning approaches for PCIMs [8] to                                 0             s f (l, t, y)      if σ(l, t, y) = s
                                                                                  σ (l, t, y) =
apply to C-PCIMs. For C-PCIMs, the likelihood of                                                 σ(l, t, y)         otherwise
equation (1) can be written as
                        Y                                                   where    is the concatenation operator. Thus, S 0 can
            p(x|S, Θ) =     λcss (x) e−λs ds (x)  (2)                       be represented as a tree where leaf s of the sub-tree S
                                  s∈Σ                                       has been split according to the result of f .
where ds (x) and cs (x) are sufficient statistics of the                    In order to select the state s and basis state function
data. ds (x) is the total duration spent in state s, i.e.,                  f to use in producing a refined structure S 0 from S,
that σ(l, t, h(t, x)) = s for some l. cs (x) is the number                  we define a factored prior
of times a label l occurs in x when the state function
maps to state s for label l. Formally,                                                             p(S) ∝ κ|Σ|
                  X Z t(x)                                                  on the structure S. The posterior probability of a
       ds (x) =             1s (σ (l, τ, h (τ, x))) dτ                      structure S given
                            0                                                             Q the data x is then proportional to
                   l∈L                                                      p(S)p(x|S) = s∈Σ κγs (x), which can be computed in
                   X
       cs (x) =            1l (li ) 1s (σ(l, ti , hi )) .                   closed form. Thus, the gain in p(S|x) due to splitting
                       i                                                    state s using basis state function f is
Note that since σ(l, τ, h) in the integral above is piece-                                                p(S 0 |x)
                                                                                     Gain(S → S 0 ) =
wise constant in τ , the integral reduces to a sum over                                                   p(S|x)
the constant pieces of σ(l, τ, h).                                                                        Q
                                                                                                             0    0 κγs0 (x)
                                                                                                        = Qs ∈Σ
A product of Gamma priors on λs is conjugate for Θ.                                                                κγs (x)
The corresponding prior and the posterior densities for                                                   Q s∈Σ
                                                                                                            s0 ∈Σf κγs s (x)
                                                                                                                           0
λs are given by                                                                                         =                    .
                                                                                                                 κγs (x)
                        β α α−1 −βλs
        p(λs |α, β) =         λ    e                                        We refine the structure greedily, choosing the refine-
                       Γ(α) s
                                                                            ment with the highest gain, until no further gain re-
      p(λs |α, β, x) = p (λs |α + cs (x), β + ds (x)) .                     sults.
5     Basis State Functions for C-PCIMs                     using this trivial identity attribute.

The modeling power of a family of C-PCIM is deter-          5.2    Types of Basis State Functions
mined by the basis B of state functions selected. The
basis needs to capture the aspects of the history that      Allowing large classes of basis state functions that de-
determine the intensities of events, and need to dis-       pend arbitrarily on both the history and time as well
tinguish labels with different intensities. In addition,    as the label is difficult computationally. We therefore
the basis needs to allow the sharing of parameters be-      restrict attention to three specific classes of basis state
tween labels to allow for generalization of event be-       functions in this paper.
havior between labels. This is particularly important
in problems where some labels occur rarely. We will         History Basis State Functions: A history basis
give basis state functions that take advantage of known     state function f (l, t, y) depends only on the history y
structure in the label sets in order to do this. We first   and the time t and not the label l. In this paper,
describe how we capture label space structure through       we concentrate on a particular class of history basis
label attributes, and then give an ontology of basis        state functions fa,v,d1 ,d2 (l, t, y) indexed by an attribute
state functions that make use of this structure.            a ∈ A, a value v ∈ Va and time offsets d2 > d1 ≥ 0,
                                                            and given by
5.1   Structured Labels
                                                                                             if ∃(t0 , l0 ) ∈ y :
                                                                                      
                                                                                      1
                                                                                                           t0 ∈ [t − d2 , t − d1 )
                                                                                      
                                                                                      
When the labels have a known structure, we will take        fa,v,d1 ,d2 (l, t, y) =
advantage of it in order to define models that can learn                              
                                                                                                                   ∧ va (l0 ) = v
                                                                                      
dependencies between events with labels that may be                                    0    otherwise.
rare, or even not occur in the training data. For ex-
ample, in data center system event logs, events may         That is fa,v,d1 ,d2 (l, t, y) tests if the history y contains
be labeled with the machine on which the event took         an event in the time window between d1 and d2 before
place and the type of the event. An event of type           t, whose label has the value v of attribute a.
disk-sector-error may occur on machine 9,045.               Example. In modeling web search query
While we may have never observed a disk sector error        logs,      the      history       basis     state  function
on a different machine 6,732, we may wish to allow          fquery-category,Health,1 hr,1 day (l, t, y)  tests whether
the structure learning procedure to determine whether       y contains a query whose query-category attribute
the behavior of disk-sector-error events generalizes        is Health between 1 hour and 1 day before t.
across machines. Thus, we would like the basis state
functions to be able to query for the message repre-        Label Basis State Functions: A label basis state
sented by a label independently of the machine.             function f (l, t, y) depends only on the label l and not
In general, we assume that the label set L has a            the time t nor the history y. A label basis state func-
set of attributes A, where each attribute a ∈ A             tion is fa,v (l, t, y) is indexed by an attribute a ∈ A, a
can take values in a set Va . Label l takes on              value v ∈ Va and is given by
value va (l) of attribute a. In the example above,                                         (
A = {machine-id, event-type}, and Vmachine-id                                               1 if va (l) = v
                                                                          fa,v (l, t, y) =
ranges over all the machines in the data center, and                                        0 otherwise.
Vevent-type ranges over all possible events. If prior in-
formation about the labels is available, it may be en-      That is fa,v (l, t, y) simply tests whether the attribute
coded through label attributes. For example, if we          a of label l has value v.
know a priori that machines in the data center are          Example. The label basis state function
grouped into database servers and web servers, we           fquery-category,Health (l, t, y) tests whether l has
could introduce an attribute server-type that takes         query-category attribute Health.
on values Vserver-type {database, web}. On the other
hand, we can access label identity as an attribute
                                                            Match Basis State Functions:                       A match basis
by using the attribute identity with Videntity = L,
                                                            function fa,d1 ,d2 (l, t, y) is given by
videntity (l) = l. In the descriptions below, we will
therefore always assume that there are label attributes
                                                                                            if ∃(t0 , l0 ) ∈ y :
                                                                                    
                                                                                     1
defined. In cases where no structural information is
                                                                                    
                                                                                                          t0 ∈ [t − d2 , t − d1 )
                                                                                    
                                                                                    
available we will simply use A = {identity}. In par-        fa,d1 ,d2 (l, t, y) =
                                                                                                                ∧ va (l0 ) = va (l)
ticular, the basis state functions for PCIMs [8] do not
                                                                                    
                                                                                    
                                                                                    
explicitly use label attributes, but can be described as                             0     otherwise.
In other words, the match basis state function tests                                   -4
whether the history y contains an event in the time
                                                                                       -5




                                                              Log Likelihood (x105)
window between d1 and d2 before t, whose label
matches l in attribute a. This kind of basis state                                     -6
function is useful in modeling repetitions of a given
attribute in an event stream.                                                          -7
                                                                                                                     PCIM
Example.          The          basis      state   function                             -8
fquery-category,1 hr,1 day (l, t, y) tests whether y con-                                                            A-PCIM
tains a query with the same query-category                                             -9
attribute as l between 1 hour and 1 day before t.                                                                    C-PCIM
                                                                                      -10
We note that history basis state functions are equiva-                                      1%         10%                  100%
lent to the history basis functions of PCIMs [8] when                                            Training Set Size
the attribute a is restricted to be the identity at-
tribute described above. Thus, a PCIM can be rep-            Figure 3: Test set log likelihood of PCIM, A-PCIM,
resented as a C-PCIM that uses only the identity             and C-PCIM for the search query data, as a function
attribute, and whose state function uses a tree that         of training set size.
first splits based on every possible label basis state
function, and then uses only history basis state func-
tions. We use the term Attribute PCIM (A-PCIM)
to refer to the slight generalization of PCIMs that al-      from approximately 6,000 users over a two month pe-
lows the history basis state functions to use arbitrary      riod and a test set consisting of approximately 170,000
attributes.                                                  queries from approximately 6,000 users over the next
                                                             month. There was no user or time overlap between the
                                                             training and test sets. The test set was restricted to
6     Experimental Results                                   contain only users whose query history was available
                                                             for at least two weeks. The queries were automat-
In this section we evaluate the value of C-PCIMs on          ically mapped to a two level hierarchy of categories
two real world tasks with large structured label spaces.     such as Health & Wellness/Mental Health. Thus,
The first is to model the query behavior of web search       our label set consisted of the 476 categories in the hi-
users, and the second is to model the behavior of a          erarchy, while the 37 coarse level categories were used
cluster of machines in a commercial data center.             as a second (i.e. non-identity) attribute, which we
In order to evaluate the value of C-PCIMs in mod-            name coarse. All labels occurred in the training set.
eling event streams with large label spaces, we com-         We use a product of terms of the form of equation (1),
pare C-PCIMs to PCIMs. To explore the gains due              one per user, to model the data, choosing not to model
to conjoint modeling as opposed to the use of richer         inter-user dependencies. We investigated three model
label attributes in modeling, we also compare to A-          classes for modeling users’ query behavior. First, we
PCIMs. It has been shown that PCIMs have better              built PCIMs which used only history basis state func-
computational and predictive performance in compar-          tions using the identity attribute (i.e. the history
ison to Poisson networks [15], which model conditional       basis functions only tested for the occurrence of spe-
intensity functions using generalize linear models. We       cific labels in the history). Second, we built A-PCIMs
therefore do not compare to Poisson networks and             that were also allowed to use the coarse attributes in
other closely related approaches that use regression         the history basis functions. Third, we built C-PCIMs
techniques to model the conditional intensity func-          that also used label basis state functions that used
tions [23, 24]. CTBNs with “toggle” variables mod-           the identity and coarse attributes and match state
eling events and latent variables modeling dynamics          functions that used the identity attribute. Match
are a natural baseline for continuous time event mod-        state functions using the coarse attribute were not
eling. We explored building such models using the            allowed due to reduce the computation time. All mod-
CTBN-RLE toolkit [20], but the current version does          els used a prior with α = 1/365 and β = 1 day, and
not scale to these data sets [19].                           κ = 0.001. The history and match basis state functions
                                                             used the time windows [t − 1 hr, t), [t − 1 day, t − 1 hr),
6.1   Web Search Query Behavior                              [t − 7 days, t − 1 day), and (∞, t − 7 days). All mod-
                                                             els took less than 12 hours to train on a single 3 GHz
Queries issued by the users of a major commercial
                                                             workstation.
search engine were used to generate a training set
consisting of approximately 100,000 queries collected        Figure 3 shows the test set log likelihood of all three
models as the amount of training data is varied. The                    100%
log likelihoods of the PCIM and A-PCIM are nearly                                                    PCIM
indistinguishable, while the C-PCIM performs much                       75%
better, especially with smaller amounts of training                                                  A-PCIM




                                                            Precision
data. This is the expected behavior, since C-PCIMs                                                   C-PCIM
are better able to share parameters. Note that since                    50%
C-PCIMs, A-PCIMs, and PCIMs define the same fam-
ily of marked point processes, we expect them to ap-
                                                                        25%
proach the same predictive performance as the training
set grows. While the gap between C-PCIMs and A-
PCIMs/PCIMs does shrink as the amount of training                        0%
data grows, C-PCIMs have higher training set like-                             0%   25%    50%     75%        100%
lihood even when all the training data is used. We                                        Recall
examined the likelihoods assigned to each test event
by the C-PCIM and the A-PCIM trained with the full         Figure 4: Precision-Recall curves of PCIM, A-
training set, in order to determine the statistical sig-   PCIM, and C-PCIM for predicting Health &
nificance of this gap. We grouped the per-event like-      Wellness/Mental Health queries.
lihoods by label, and found that the C-PCIM signif-
icantly outperforms the A-PCIM (p = 0.01) on 391
out of the 436 labels observed in the test set, while      allow inter-machine dependencies. We experimented
under-performing the A-PCIM on none, according to          with using the power of C-PCIMs to allow machine
a paired sign test.                                        specific dependencies. In particular, we compare a
                                                           PCIM and a C-PCIM that treat machines identically
To understand the practical impact of the likeli-          with a PCIM and a C-PCIM that allow machine spe-
hood gains, we used importance sampling [8] to fore-       cific dependencies. Non-identical modeling of ma-
cast whether each test user would issue Health &           chines may be useful if, for example, a certain machine
Wellness/Mental Health queries on the eighth day           is old is more prone to failures. Models that treat ma-
in the test set given their behavior in the first week.    chines identically are allowed to use only the message
Precision-recall curves for the three models are given     attributes of labels. The PCIM that treats machines
in Figure 4. Although there are only eight test users      identically is forced to use the same conditional inten-
who issued these queries on that day, C-PCIMs make         sity function for all labels that agree on their message
much better predictions than A-PCIMs and PCIMs.            attributes, pooling data from these labels during train-
Such predictions are useful in applications such as tar-   ing. Note that in this setting, PCIMs and A-PCIMs
geted display (banner) advertising, where an adver-        are equivalent in both the identical machine and non-
tiser may only wish their advertisements to be shown       identical machine cases.
to web users who are interested in a topic related to
their advertisement. The assumption is that users who      All models used the prior α = 0.01, β = 1.0 day, and
will issue a query in a particular category during the     κ = 0.01. The C-PCIM trees were truncated at a
day may be more receptive to advertisements related        depth of 30 to save computation. The history and
to that category during that day.                          match basis state functions used the time windows [t−
                                                           20 min, t) and [t − 1 hr, t − 20 min). The models took
6.2   Data Center Machine Behavior                         less than 2 hours to train, except the non-identical
                                                           machine PCIM, which learned a separate tree for each
System logs from a cluster of machines in a commer-        of the 5,307 labels present in the training data. This
cial data center were used to generate a data set of       took less than 12 hours.
about 300,000 logged system events from 71 machines,
                                                           Figure 5 shows the log likelihood of events after the
over the period of a month. There were 221 possible
                                                           first two weeks given the events of the first two weeks
messages. This gave 15,691 possible labels, each of
                                                           for all four models. Note the log likelihoods can be
which was a machine-message combination. Each ma-
                                                           positive since the likelihoods are density values and
chine belonged to one of four machine types that was
                                                           not probabilities. It can be seen that building separate
known a priori. Thus, each label had four attributes
                                                           PCIMs for each label, without pooling data across ma-
{machine, message, machine-type, identity}. The
                                                           chines for each message results in a large loss in test
first two weeks of data was used for training, and the
                                                           set likelihood, due to data sparsity. Both C-PCIMs
rest for testing.
                                                           outperform the PCIMs. Note that the identical ma-
We use a product of terms of the form of equation (1),     chine C-PCIM only has access to message information,
one per machine, to model the data, choosing to not        and therefore does not have access to any label struc-
                         12                                                    References
                         10                                                     [1] Leonard E. Baum and Ted Petrie. Statistical in-
 Log Likelihood (x106)


                                                                                    ference for probabilistic functions of finite state
                         8                                                          Markov chains. Ann. Math. Stat., 37(6):1554–
                                                                                    1563, December 1966.
                         6
                                                                                [2] Emery N. Brown, Robert E. Kass, and Parth P.
                         4
                                                                                    Mitra. Multiple neural spike train data analy-
                         2                                                          sis: state-of-the-art and future challenges. Nature
                                                                                    Neuro., 7(5), 2004.
                         0
                              PCIM         PCIM       C-PCIM      C-PCIM        [3] David Maxwell Chickering, David Heckerman,
                              (ident)   (non-ident)    (ident)   (non-ident)        and Christopher Meek. A Bayesian approach to
                                                                                    learning Bayesian networks with local structure.
Figure 5: Test set log likelihood of PCIMs and C-                                   In UAI, 1997.
PCIMs that treat machines as identical, and PCIMs
                                                                                [4] D. J. Daley and D. Vere-Jones. An Introduction to
and C-PCIMs that do not, for the data center event
                                                                                    the Theory of Point Processes: Elementary The-
log data.
                                                                                    ory and Methods, volume I. Springer, 2nd edition,
                                                                                    2003.
ture. However, it gives a large gain over PCIMs due                             [5] Thomas Dean and Keiji Kanazawa. Probabilistic
to the use of match basis state functions, which model                              temporal reasoning. In AAAI, 1988.
repeated events. The ability of the non-identical ma-
chine C-PCIM to leverage label structure to learn de-                           [6] Vanessa Didelez. Graphical models for marked
pendencies specific to machines or machine types leads                              point processes based on local independence. J.
to a further gain in likelihood. Unfortunately, events                              Roy. Stat. Soc., Ser. B, 70(1):245–264, 2008.
of interest such as machine failures are very rare in
this data set, so that there are not enough test cases                          [7] N. Friedman, I. Nachman, and D. Peér. Using
to obtain statistically significant comparisons between                             Bayesian networks to analyze expression data. J.
C-PCIMs and PCIMs.                                                                  Comp. Bio., 7:601–620, 2000.

                                                                                [8] Asela Gunawardana, Christopher Meek, and
7                        Conclusions                                                Puyang Xu. A model for temporal dependencies
                                                                                    in event streams. In NIPS, 2011.
We have introduced Conjoint PCIMs, and shown how
                                                                                [9] Mei-Yuh Hwang and Xuedong Huang. Shared-
they improve upon PCIMs in modeling the dynam-
                                                                                    distribution hidden Markov models for speech
ics of two real world event streams with large struc-
                                                                                    recognition. IEEE Trans. Spch. & Aud. Proc.,
tured label sets. We have shown how conjoint model-
                                                                                    1(4):414–420, October 1993.
ing across labels allows better models to be built from
sparse data, and how C-PCIMs can leverage known                                [10] Mei-Yuh Hwang, Xuedong Huang, and Fileno A.
structure in the label space. We have also shown that                               Alleva. Predicting unseen triphones with senones.
the predictive gains of C-PCIMs are not achieved by                                 IEEE Trans. Spch. & Aud. Proc., 4(6):412–419,
Attribute PCIMs that leverage label structure but do                                November 1996.
not use conjoint modeling.
                                                                               [11] Slava M. Katz. Estimation of probabilities from
While it would be of interest to compare the per-
                                                                                    sparse data for the language model component of
formance of C-PCIMs with other approaches such as
                                                                                    a speech recognizer. IEEE Trans. Acoust., Spch.,
CTBNs, limitations in the currently available CTBN
                                                                                    & Sig.. Proc., ASSP-35(3):400–401, March 1987.
implementation prevented us from doing so. It would
also be interesting to investigate approaches that ex-                         [12] Uri Nodelman, Christian R. Shelton, and Daphne
tend CTBNs to allow them to take advantage of label                                 Koller. Learning continuous time Bayesian net-
structure in the manner of C-PCIMs. Another future                                  works. In UAI, 2003.
direction of interest is to investigate other extensions
of PCIMs that allow parameter sharing across labels,                           [13] Adam Oliner and Jon Stearley. What supercom-
such as hierarchical Bayesian approaches or regular-                                puters say - an analysis of five system logs. In
ization approaches.                                                                 IEEE/IFIP Conf. Dep. Sys. Net., 2007.
[14] Lawrence R. Rabiner.       A tutorial on hid-
     den Markov models and selected applications in
     speech recognition. Proc. IEEE, 77(2):257–286,
     February 1989.

[15] Shyamsundar Rajaram, Thore Graepel, and Ralf
     Herbrich. Poisson-networks: A model for struc-
     tured point processes. In AIStats, 2005.

[16] Vinayak Rao and Yee Whye Teh. Fast MCMC
     sampling for Markov jump processes and contin-
     uous time Bayesian networks. In UAI, 2011.

[17] Vinayak Rao and Yee Whye Teh. Gaussian pro-
     cess modulated renewal processes. In NIPS, 2011.

[18] Ardavan Saeedi and Alexandre Bouchard-Côté.
     Priors over recurrent continuous time processes.
     In NIPS, 2011.

[19] Christian R. Shelton. Personal communication,
     2012.

[20] Christian R. Shelton, Yu Fan, William Lam, Joon
     Lee, and Jing Xu. Continuous time Bayesian net-
     work reasoning and learning engine. JMLR, 11,
     2010.

[21] Aleksandr Simma, Moises Goldszmidt, John Mac-
     Cormick, Paul Barham, Richard Brock, Rebecca
     Isaacs, and Reichard Mortier. CT-NOR: Repre-
     senting and reasoning about events in continuous
     time. In UAI, 2008.

[22] Aleksandr Simma and Michael I. Jordan. Model-
     ing events with cascades of Poisson processes. In
     UAI, 2010.

[23] Wilson Truccolo, Uri T. Eden, Matthew R. Gel-
     lows, John P. Donoghue, and Emery N. Brown.
     A point process framework relating neural spik-
     ing activity to spiking history, neural ensemble,
     and extrinsic covariate effects. J. Neurophysiol.,
     93:1074–1089, 2005.

[24] Duy Q. Vu, Arthur U. Asuncion, David R.
     Hunter, and Padhraic Smyth. Continuous-time
     regression models for longitudinal networks. In
     NIPS, 2011.

[25] M. Yuan and Y. Lin. Model selection and estima-
     tion in regression with grouped variables. J. Roy.
     Stat. Soc., Ser. B, 68(1):49–67, 2006.