=Paper=
{{Paper
|id=None
|storemode=property
|title=Conjoint Modeling of Temporal Dependencies in Event Streams
|pdfUrl=https://ceur-ws.org/Vol-962/paper08.pdf
|volume=Vol-962
|dblpUrl=https://dblp.org/rec/conf/uai/ParikhGM12
}}
==Conjoint Modeling of Temporal Dependencies in Event Streams==
Conjoint Modeling of Temporal Dependencies in Event Streams Ankur P. Parikh∗† Asela Gunawardana∗ Christopher Meek School of Computer Science Microsoft Research Microsoft Research Carnegie Mellon University One Microsoft Way One Microsoft Way Pittsburgh, PA 15213, USA Redmond, WA 98052, USA Redmond, WA 98052, USA apparikh@cs.cmu.edu aselag@microsoft.com meek@microsoft.com Abstract a machine failing may depend on cascades of various prior warnings, errors, and failures. Many real world applications depend on In many domains, it is valuable to model fine distinc- modeling the temporal dynamics of streams tions between event types. For example, in targeted of diverse events, many of which are rare. advertising, it is valuable to distinguish whether a user We introduce a novel model class, Con- will issue queries related to mental health or to back joint Piecewise-Constant Conditional Inten- pain rather than simply predicting that a user will sity Models, and a learning algorithm that issue a healthcare query. In a data center, it is more together yield a data-driven approach to pa- useful to learn that particular disk errors predict failed rameter sharing with the aim of better mod- reboots than to know that generic error messages pre- eling such event streams. We empirically dict generic failures. While useful to model, such fine demonstrate that our approach yields more grained event types are rarer than the coarser grained accurate models of two real world data sets: ones that include them, and are therefore more difficult search query logs and data center system to model. Many models of temporal dependencies in logs. event streams, such as the Piecewise-Constant Condi- tional Intensity Model (PCIM) [8], learn the dependen- cies of each event type separately, using independent sub-models for each event type. Thus, they are not 1 Introduction able to model rare events well. Event streams—temporal sequences of discrete events, In this paper, we address this problem using pa- are ubiquitous in many domains such as the firing rameter sharing, by introducing Conjoint Piecewise- patterns of neurons [2], gene expression data [7], sys- Constant Conditional Intensity Models (C-PCIMs) tem error logs [13], and web search engine query logs. and a learning algorithm for C-PCIMs, which yield Learning a model for the temporal dependencies be- a novel data-driven approach to modeling fine grained tween events can be useful for understanding and ex- event streams. C-PCIMs generalize PCIMs by allow- ploiting the dynamics in such domains. For example, ing parameters to be shared across event types, and learning that particular system events on a machine our learning algorithm uses the data to determine predict failures at a later time may allow a system ad- which parameters should be shared. In particular, we ministrator to prioritize preventive maintenance. Un- give a conjugate prior that allows parameter learning derstanding how web search queries for commercially for the C-PCIM to be performed as efficiently as for valuable terms are dependent on prior queries, possi- the PCIM, and that leads to a closed-form marginal bly for other topics, can help in targeted advertising. likelihood, allowing efficient structure learning. Dur- In many cases the data exhibits complex temporal de- ing structure learning, the C-PCIM learns what event pendencies. For example, what a user will query for types in what historical contexts can be modeled by at a particular time can depend on queries issued in shared parameters, thereby allowing more efficient use the last few minutes, the previous day, as well as in of data during parameter estimation. In cases where the extended past. In a data center, the likelihood of events are structured, with the different event types having known attributes, we show how structure learn- ∗ These authors contributed equally to the work. ing can take advantage of these attributes to distin- † This research was conducted while APP was a re- guish between different event types when their depen- search intern at Microsoft Research, Redmond. dencies differ, while sharing parameters when they do nonparametric Bayesian approach to conjoint model- not. Finally, we give empirical evidence that demon- ing in Markov processes such as CTBNs. Regulariza- strates the value of C-PCIMs in two real applications tion approaches may also be used for conjoint model- which are not well addressed by existing approaches– ing [25], although we are unaware of applications to modeling the temporal query dynamics of web search continuous time event modeling. users and modeling the temporal dynamics of system events in a data center. In the second application, we 3 The Model demonstrate that the expressive power of C-PCIMs combined with the data driven learning approach al- We represent an event sequence as y = {(ti , li )}i=1 n lows us to relax the strong assumption of identical ma- with 0 < t1 < · · · < tn , where ti ∈ [0, ∞) is the time chines, yielding further accuracy improvements. of the ith event and li is its label, drawn from a finite label set L. The history at time t of event sequence y is 2 Related Work the sub-sequence h(t, y) = {(ti , li ) | (ti , li ) ∈ y, ti ≤ t}. We write hi for h(ti−1 , y) when it is clear from context Event streams can be modeled in either discrete or which y is meant. By convention t0 = 0. We define continuous time. Using discrete time approaches such the ending time t(y) of an event sequence y as the time as Hidden Markov Models (HMMs) [1, 14] and Dy- of the last event in y: t(y) = max ({t : (t, l) ∈ y}) so namic Bayesian Networks (DBNs) [5] require event that t(hi ) = ti−1 . times to be discretized, which requires a choice of sam- The data x, which is a particular event sequence, is pling rate, and with it, trade-offs involving fidelity of modeled as a realization of a regular marked point pro- representation, time-span of dependencies, and com- cess [4, 6] with likelihood putational cost. We avoid this choice, and model event streams in continuous time. There have been a n YY number of recent approaches for modeling continuous- p(x|θ) = λl (ti |hi , θ)1l (li ) e−Λl (ti |hi ;θ) (1) time processes. C-PCIMs, like PCIMs [8], model event l∈L i=1 streams as marked point processes, where events have where λl (t|h; θ) is the conditional intensity function [4] both an arrival time and a label specifying the type Rt for label l, and Λl (t|h; θ) = t(h) λl (τ |h; θ)dτ . We of event, via conditional intensity functions. A num- ber of other closely related approaches [15, 23, 24] use write 1Z (·) for the indicator function of a set Z and regression techniques such as generalized linear mod- 1z (·) for the indicator of the singleton {z}. Intuitively, els, Cox regression and Aalen regression to model con- λl (t|h; θ) is the expected rate of events with label l at ditional intensity functions. Continuous Time Noisy- time t given the history h. Note that despite the sim- Or [21] and Poisson cascades [22] are also approaches ilarity to the likelihood of a non-homogeneous Pois- for modeling event streams. These approaches do not son process, this likelihood does not in general define address model selection, and require a parametric form a Poisson process as the conditioning on history can for temporal dependencies to be specified. Predic- cause the independent increments property of Poisson tive performance is strongly impacted by this modeling processes to not hold. The conditioning on the en- choice, which is domain dependent [21, 22]. There has tire history also means that such processes are non- also been some recent work on nonparametric Bayesian Markovian. Piecewise Constant Conditional Intensity approaches for modeling unlabeled event streams [17]. Models (PCIMs) [8] are a particular class of marked Continuous Time Bayesian Networks (CTBNs) [12] point process where the conditional intensity functions and Markov Jump Processes [16] are Markov process are restricted to be piecewise constant. In this paper, models of the trajectories of discrete variables over we introduce Conjoint PCIMs which are PCIMs that continuous time. In contrast to PCIMs and C-PCIMs, use a conjoint representation for the conditional inten- they are Markov process models. A CTBN can be sity functions. These models are described below. used to model an event stream by modeling each kind of event as a transition of a “toggle” variable [20], and 3.1 PCIMs using latent state variables to model their dynamics, In this section, we review PCIMs [8]. PCIMs are a to give a continuous time analog of an HMM. class of marked point process where the conditional Conjoint PCIMs differ from PCIMs in the way that intensity function for each label is a piecewise con- parameter are shared. Such approaches have been stant function of time, taking one of a finite number used in other problems such as for building hidden of values. That is λl (t|y) is piecewise constant in t Markov models with large state spaces [9, 10], and for for all t > t(y), and takes on values {λls } for s ∈ Σl , building n-gram language models [11]. Hierarchical where Σl is a finite label-dependent state set. The Gamma-Exponential processes [18] are a hierarchical value λls taken on by λl (t|y) at t for each y is specified by a piecewise constant state function σl (t, y), so that λl (t|y) = λlσl (t,y) . A in [t-1,t) Note that the state s summarizes all the information about t and y necessary for computing λl (t|y) in that no yes given s, λl (t|y) can be computed without further in- formation about t and y. However, unlike in Markov A in λA=0.1 models such as CTBNs [12], the state does not contain [t-2,t-1) all the information about t and y for predicting future states. no yes As described above, the conditional intensity function λA=10.0 λA=0.0 λl (t|y) can be specified by a structure Sl = (Σl , σl (·, ·)) consisting of the state set Σl and the state function σl (·, ·) and a parameter parameter vector θl composed of a non-negative intensity λls for each s ∈ Σl . In turn, B in a PCIM is specified by a structure S and a parameter [t-1,t) Θ that consist of the per-label structures Sl and per- label parameter vectors θl . no yes Gunawardana et al. show [8] that given the structure A in B in S, a product of Gamma distributions is a conjugate [t-5,t) [t-2,t-1) prior for Θ, and that under this prior, the marginal likelihood of the data can be given in closed form. no yes no yes Thus, parameter estimation can be done in closed form given a structure, and imposing a structural prior al- λB=0.0 λB=0.1 λB=10.0 λB=0.0 lows a closed form Bayesian score to be computed for a structure. The structure of a PCIM can be represented by a set of decision trees [8]. In particular, each state function Figure 1: Example Decision trees representing a PCIM σl can be represented by a decision tree whose leaves for a problem with L = {A, B}. represent the states s ∈ Σl , as shown in the example of Figure 1. The decision nodes in each tree contain functions that map a time t and a history y to one of fore allow an intensity value λs to be shared across con- its child nodes. These functions are piecewise constant ditional intensity functions for different labels, possi- in time, so that the state function σl (t, y) represented bly at different times and for different histories. Thus, by a decision tree is also piecewise constant. Struc- a C-PCIM is defined by a structure S = (Σ, σ(·, ·, ·)) ture learning for each label l is performed by starting consisting of a state set Σ and a state function σ(·, ·, ·) with the trivial tree, and iteratively refining it greed- as well as a parameter vector Θ = {λs }s∈Σ , all of which ily based on the closed form Bayesian score mentioned are shared across labels l ∈ L. above. A detailed presentation of this learning proce- dure as generalized to C-PCIMs is given in section 4.1. We use a decision tree representation of the structure S of a C-PCIM. However, instead of using a different 3.2 Conjoint PCIMs decision tree for each label l as in Gunawardana et al. [8], we use a single tree that is used across all la- Conjoint PCIMs (C-PCIMs), like PCIMs, are marked bels, as shown in the example of Figure 2. The leaves point processes where the conditional intensity func- of the tree represent states s ∈ Σ. However, the de- tion λl (t|y) are piecewise constant and take on a finite cision nodes of a C-PCIM tree contain functions that number of values, but unlike in PCIMs, the conditional can depend on l as well as on t and y. In particu- intensity functions in C-PCIMs take on values from a lar, each decision node contains a basis state function single set of values shared across all labels l ∈ L. Thus f chosen from a given set B. Each basis state func- λl (t|y) takes on values {λs } for s ∈ Σ which are shared tion f (l, t, y) is piecewise constant in t for each l and across all l. Which of these values is taken on by λl (t|y) y, and takes values from a finite basis state set Σf . is specified by a C-PCIM state function σ(l, t, y) which Thus, a decision node with basis state function f has unlike PCIM state functions, is also a function of the a child node corresponding to each s0 ∈ Σf . Since the label l whose conditional intensity function is being basis state functions are defined to be valid piecewise evaluated. Thus, λl (t|y) = λσ(l,t,y) . C-PCIMs there- constant state functions, the mapping from (l, t, y) to leaves given by the tree is also a valid piecewise con- In our experiments, we obtain the point estimate Θ̂ = stant state function σ(l, t, y). E[Θ|S, x] from the training data, given by α + cs (x) λ̂s = . l in β + ds (x) [t-1,t) 4.1 Structure Learning no yes For structure learning, we can write the marginal like- l in lihood of the data x given the structure S in closed l=A [t-2,t-1) form as Y βα Γ(α + cs (x)) no yes no yes p(x|S) = . Γ(α) (β + ds (x))α+cs (x) s∈Σ | A in {z } λl=0.1 λl=10.0 λl=0.0 γs (x) [t-5,t) As with PCIMs, we use a Bayesian decision tree build- no yes ing procedure [3] in order to learn the structure S. We begin with the trivial structure Σ = s0 , σ(l, t, y) = s0 λl=0.0 λl=0.1 with only the root node s0 , and refine the structure S by iteratively splitting leaves s ∈ Σ based on basis state functions f ∈ B. In particular Given a current structure S = (Σ, σ), a new structure S 0 = (Σ0 , σ 0 ) is Figure 2: Decision tree representing a C-PCIM equiv- produced by selecting a state s ∈ Σl and a basis state alent to the example PCIM of Figure 1. function f ∈ B and refining s based on f as follows: 4 Learning Conjoint PCIMs Σ0l = [ {s s0 } ∪ Σl \s s0 ∈Σf In this section, we directly generalize the parameter ( and structure learning approaches for PCIMs [8] to 0 s f (l, t, y) if σ(l, t, y) = s σ (l, t, y) = apply to C-PCIMs. For C-PCIMs, the likelihood of σ(l, t, y) otherwise equation (1) can be written as Y where is the concatenation operator. Thus, S 0 can p(x|S, Θ) = λcss (x) e−λs ds (x) (2) be represented as a tree where leaf s of the sub-tree S s∈Σ has been split according to the result of f . where ds (x) and cs (x) are sufficient statistics of the In order to select the state s and basis state function data. ds (x) is the total duration spent in state s, i.e., f to use in producing a refined structure S 0 from S, that σ(l, t, h(t, x)) = s for some l. cs (x) is the number we define a factored prior of times a label l occurs in x when the state function maps to state s for label l. Formally, p(S) ∝ κ|Σ| X Z t(x) on the structure S. The posterior probability of a ds (x) = 1s (σ (l, τ, h (τ, x))) dτ structure S given 0 Q the data x is then proportional to l∈L p(S)p(x|S) = s∈Σ κγs (x), which can be computed in X cs (x) = 1l (li ) 1s (σ(l, ti , hi )) . closed form. Thus, the gain in p(S|x) due to splitting i state s using basis state function f is Note that since σ(l, τ, h) in the integral above is piece- p(S 0 |x) Gain(S → S 0 ) = wise constant in τ , the integral reduces to a sum over p(S|x) the constant pieces of σ(l, τ, h). Q 0 0 κγs0 (x) = Qs ∈Σ A product of Gamma priors on λs is conjugate for Θ. κγs (x) The corresponding prior and the posterior densities for Q s∈Σ s0 ∈Σf κγs s (x) 0 λs are given by = . κγs (x) β α α−1 −βλs p(λs |α, β) = λ e We refine the structure greedily, choosing the refine- Γ(α) s ment with the highest gain, until no further gain re- p(λs |α, β, x) = p (λs |α + cs (x), β + ds (x)) . sults. 5 Basis State Functions for C-PCIMs using this trivial identity attribute. The modeling power of a family of C-PCIM is deter- 5.2 Types of Basis State Functions mined by the basis B of state functions selected. The basis needs to capture the aspects of the history that Allowing large classes of basis state functions that de- determine the intensities of events, and need to dis- pend arbitrarily on both the history and time as well tinguish labels with different intensities. In addition, as the label is difficult computationally. We therefore the basis needs to allow the sharing of parameters be- restrict attention to three specific classes of basis state tween labels to allow for generalization of event be- functions in this paper. havior between labels. This is particularly important in problems where some labels occur rarely. We will History Basis State Functions: A history basis give basis state functions that take advantage of known state function f (l, t, y) depends only on the history y structure in the label sets in order to do this. We first and the time t and not the label l. In this paper, describe how we capture label space structure through we concentrate on a particular class of history basis label attributes, and then give an ontology of basis state functions fa,v,d1 ,d2 (l, t, y) indexed by an attribute state functions that make use of this structure. a ∈ A, a value v ∈ Va and time offsets d2 > d1 ≥ 0, and given by 5.1 Structured Labels if ∃(t0 , l0 ) ∈ y : 1 t0 ∈ [t − d2 , t − d1 ) When the labels have a known structure, we will take fa,v,d1 ,d2 (l, t, y) = advantage of it in order to define models that can learn ∧ va (l0 ) = v dependencies between events with labels that may be 0 otherwise. rare, or even not occur in the training data. For ex- ample, in data center system event logs, events may That is fa,v,d1 ,d2 (l, t, y) tests if the history y contains be labeled with the machine on which the event took an event in the time window between d1 and d2 before place and the type of the event. An event of type t, whose label has the value v of attribute a. disk-sector-error may occur on machine 9,045. Example. In modeling web search query While we may have never observed a disk sector error logs, the history basis state function on a different machine 6,732, we may wish to allow fquery-category,Health,1 hr,1 day (l, t, y) tests whether the structure learning procedure to determine whether y contains a query whose query-category attribute the behavior of disk-sector-error events generalizes is Health between 1 hour and 1 day before t. across machines. Thus, we would like the basis state functions to be able to query for the message repre- Label Basis State Functions: A label basis state sented by a label independently of the machine. function f (l, t, y) depends only on the label l and not In general, we assume that the label set L has a the time t nor the history y. A label basis state func- set of attributes A, where each attribute a ∈ A tion is fa,v (l, t, y) is indexed by an attribute a ∈ A, a can take values in a set Va . Label l takes on value v ∈ Va and is given by value va (l) of attribute a. In the example above, ( A = {machine-id, event-type}, and Vmachine-id 1 if va (l) = v fa,v (l, t, y) = ranges over all the machines in the data center, and 0 otherwise. Vevent-type ranges over all possible events. If prior in- formation about the labels is available, it may be en- That is fa,v (l, t, y) simply tests whether the attribute coded through label attributes. For example, if we a of label l has value v. know a priori that machines in the data center are Example. The label basis state function grouped into database servers and web servers, we fquery-category,Health (l, t, y) tests whether l has could introduce an attribute server-type that takes query-category attribute Health. on values Vserver-type {database, web}. On the other hand, we can access label identity as an attribute Match Basis State Functions: A match basis by using the attribute identity with Videntity = L, function fa,d1 ,d2 (l, t, y) is given by videntity (l) = l. In the descriptions below, we will therefore always assume that there are label attributes if ∃(t0 , l0 ) ∈ y : 1 defined. In cases where no structural information is t0 ∈ [t − d2 , t − d1 ) available we will simply use A = {identity}. In par- fa,d1 ,d2 (l, t, y) = ∧ va (l0 ) = va (l) ticular, the basis state functions for PCIMs [8] do not explicitly use label attributes, but can be described as 0 otherwise. In other words, the match basis state function tests -4 whether the history y contains an event in the time -5 Log Likelihood (x105) window between d1 and d2 before t, whose label matches l in attribute a. This kind of basis state -6 function is useful in modeling repetitions of a given attribute in an event stream. -7 PCIM Example. The basis state function -8 fquery-category,1 hr,1 day (l, t, y) tests whether y con- A-PCIM tains a query with the same query-category -9 attribute as l between 1 hour and 1 day before t. C-PCIM -10 We note that history basis state functions are equiva- 1% 10% 100% lent to the history basis functions of PCIMs [8] when Training Set Size the attribute a is restricted to be the identity at- tribute described above. Thus, a PCIM can be rep- Figure 3: Test set log likelihood of PCIM, A-PCIM, resented as a C-PCIM that uses only the identity and C-PCIM for the search query data, as a function attribute, and whose state function uses a tree that of training set size. first splits based on every possible label basis state function, and then uses only history basis state func- tions. We use the term Attribute PCIM (A-PCIM) to refer to the slight generalization of PCIMs that al- from approximately 6,000 users over a two month pe- lows the history basis state functions to use arbitrary riod and a test set consisting of approximately 170,000 attributes. queries from approximately 6,000 users over the next month. There was no user or time overlap between the training and test sets. The test set was restricted to 6 Experimental Results contain only users whose query history was available for at least two weeks. The queries were automat- In this section we evaluate the value of C-PCIMs on ically mapped to a two level hierarchy of categories two real world tasks with large structured label spaces. such as Health & Wellness/Mental Health. Thus, The first is to model the query behavior of web search our label set consisted of the 476 categories in the hi- users, and the second is to model the behavior of a erarchy, while the 37 coarse level categories were used cluster of machines in a commercial data center. as a second (i.e. non-identity) attribute, which we In order to evaluate the value of C-PCIMs in mod- name coarse. All labels occurred in the training set. eling event streams with large label spaces, we com- We use a product of terms of the form of equation (1), pare C-PCIMs to PCIMs. To explore the gains due one per user, to model the data, choosing not to model to conjoint modeling as opposed to the use of richer inter-user dependencies. We investigated three model label attributes in modeling, we also compare to A- classes for modeling users’ query behavior. First, we PCIMs. It has been shown that PCIMs have better built PCIMs which used only history basis state func- computational and predictive performance in compar- tions using the identity attribute (i.e. the history ison to Poisson networks [15], which model conditional basis functions only tested for the occurrence of spe- intensity functions using generalize linear models. We cific labels in the history). Second, we built A-PCIMs therefore do not compare to Poisson networks and that were also allowed to use the coarse attributes in other closely related approaches that use regression the history basis functions. Third, we built C-PCIMs techniques to model the conditional intensity func- that also used label basis state functions that used tions [23, 24]. CTBNs with “toggle” variables mod- the identity and coarse attributes and match state eling events and latent variables modeling dynamics functions that used the identity attribute. Match are a natural baseline for continuous time event mod- state functions using the coarse attribute were not eling. We explored building such models using the allowed due to reduce the computation time. All mod- CTBN-RLE toolkit [20], but the current version does els used a prior with α = 1/365 and β = 1 day, and not scale to these data sets [19]. κ = 0.001. The history and match basis state functions used the time windows [t − 1 hr, t), [t − 1 day, t − 1 hr), 6.1 Web Search Query Behavior [t − 7 days, t − 1 day), and (∞, t − 7 days). All mod- els took less than 12 hours to train on a single 3 GHz Queries issued by the users of a major commercial workstation. search engine were used to generate a training set consisting of approximately 100,000 queries collected Figure 3 shows the test set log likelihood of all three models as the amount of training data is varied. The 100% log likelihoods of the PCIM and A-PCIM are nearly PCIM indistinguishable, while the C-PCIM performs much 75% better, especially with smaller amounts of training A-PCIM Precision data. This is the expected behavior, since C-PCIMs C-PCIM are better able to share parameters. Note that since 50% C-PCIMs, A-PCIMs, and PCIMs define the same fam- ily of marked point processes, we expect them to ap- 25% proach the same predictive performance as the training set grows. While the gap between C-PCIMs and A- PCIMs/PCIMs does shrink as the amount of training 0% data grows, C-PCIMs have higher training set like- 0% 25% 50% 75% 100% lihood even when all the training data is used. We Recall examined the likelihoods assigned to each test event by the C-PCIM and the A-PCIM trained with the full Figure 4: Precision-Recall curves of PCIM, A- training set, in order to determine the statistical sig- PCIM, and C-PCIM for predicting Health & nificance of this gap. We grouped the per-event like- Wellness/Mental Health queries. lihoods by label, and found that the C-PCIM signif- icantly outperforms the A-PCIM (p = 0.01) on 391 out of the 436 labels observed in the test set, while allow inter-machine dependencies. We experimented under-performing the A-PCIM on none, according to with using the power of C-PCIMs to allow machine a paired sign test. specific dependencies. In particular, we compare a PCIM and a C-PCIM that treat machines identically To understand the practical impact of the likeli- with a PCIM and a C-PCIM that allow machine spe- hood gains, we used importance sampling [8] to fore- cific dependencies. Non-identical modeling of ma- cast whether each test user would issue Health & chines may be useful if, for example, a certain machine Wellness/Mental Health queries on the eighth day is old is more prone to failures. Models that treat ma- in the test set given their behavior in the first week. chines identically are allowed to use only the message Precision-recall curves for the three models are given attributes of labels. The PCIM that treats machines in Figure 4. Although there are only eight test users identically is forced to use the same conditional inten- who issued these queries on that day, C-PCIMs make sity function for all labels that agree on their message much better predictions than A-PCIMs and PCIMs. attributes, pooling data from these labels during train- Such predictions are useful in applications such as tar- ing. Note that in this setting, PCIMs and A-PCIMs geted display (banner) advertising, where an adver- are equivalent in both the identical machine and non- tiser may only wish their advertisements to be shown identical machine cases. to web users who are interested in a topic related to their advertisement. The assumption is that users who All models used the prior α = 0.01, β = 1.0 day, and will issue a query in a particular category during the κ = 0.01. The C-PCIM trees were truncated at a day may be more receptive to advertisements related depth of 30 to save computation. The history and to that category during that day. match basis state functions used the time windows [t− 20 min, t) and [t − 1 hr, t − 20 min). The models took 6.2 Data Center Machine Behavior less than 2 hours to train, except the non-identical machine PCIM, which learned a separate tree for each System logs from a cluster of machines in a commer- of the 5,307 labels present in the training data. This cial data center were used to generate a data set of took less than 12 hours. about 300,000 logged system events from 71 machines, Figure 5 shows the log likelihood of events after the over the period of a month. There were 221 possible first two weeks given the events of the first two weeks messages. This gave 15,691 possible labels, each of for all four models. Note the log likelihoods can be which was a machine-message combination. Each ma- positive since the likelihoods are density values and chine belonged to one of four machine types that was not probabilities. It can be seen that building separate known a priori. Thus, each label had four attributes PCIMs for each label, without pooling data across ma- {machine, message, machine-type, identity}. The chines for each message results in a large loss in test first two weeks of data was used for training, and the set likelihood, due to data sparsity. Both C-PCIMs rest for testing. outperform the PCIMs. Note that the identical ma- We use a product of terms of the form of equation (1), chine C-PCIM only has access to message information, one per machine, to model the data, choosing to not and therefore does not have access to any label struc- 12 References 10 [1] Leonard E. Baum and Ted Petrie. Statistical in- Log Likelihood (x106) ference for probabilistic functions of finite state 8 Markov chains. Ann. Math. Stat., 37(6):1554– 1563, December 1966. 6 [2] Emery N. Brown, Robert E. Kass, and Parth P. 4 Mitra. Multiple neural spike train data analy- 2 sis: state-of-the-art and future challenges. Nature Neuro., 7(5), 2004. 0 PCIM PCIM C-PCIM C-PCIM [3] David Maxwell Chickering, David Heckerman, (ident) (non-ident) (ident) (non-ident) and Christopher Meek. A Bayesian approach to learning Bayesian networks with local structure. Figure 5: Test set log likelihood of PCIMs and C- In UAI, 1997. PCIMs that treat machines as identical, and PCIMs [4] D. J. Daley and D. Vere-Jones. An Introduction to and C-PCIMs that do not, for the data center event the Theory of Point Processes: Elementary The- log data. ory and Methods, volume I. Springer, 2nd edition, 2003. ture. However, it gives a large gain over PCIMs due [5] Thomas Dean and Keiji Kanazawa. Probabilistic to the use of match basis state functions, which model temporal reasoning. In AAAI, 1988. repeated events. The ability of the non-identical ma- chine C-PCIM to leverage label structure to learn de- [6] Vanessa Didelez. Graphical models for marked pendencies specific to machines or machine types leads point processes based on local independence. J. to a further gain in likelihood. Unfortunately, events Roy. Stat. Soc., Ser. B, 70(1):245–264, 2008. of interest such as machine failures are very rare in this data set, so that there are not enough test cases [7] N. Friedman, I. Nachman, and D. Peér. Using to obtain statistically significant comparisons between Bayesian networks to analyze expression data. J. C-PCIMs and PCIMs. Comp. Bio., 7:601–620, 2000. [8] Asela Gunawardana, Christopher Meek, and 7 Conclusions Puyang Xu. A model for temporal dependencies in event streams. In NIPS, 2011. We have introduced Conjoint PCIMs, and shown how [9] Mei-Yuh Hwang and Xuedong Huang. Shared- they improve upon PCIMs in modeling the dynam- distribution hidden Markov models for speech ics of two real world event streams with large struc- recognition. IEEE Trans. Spch. & Aud. Proc., tured label sets. We have shown how conjoint model- 1(4):414–420, October 1993. ing across labels allows better models to be built from sparse data, and how C-PCIMs can leverage known [10] Mei-Yuh Hwang, Xuedong Huang, and Fileno A. structure in the label space. We have also shown that Alleva. Predicting unseen triphones with senones. the predictive gains of C-PCIMs are not achieved by IEEE Trans. Spch. & Aud. Proc., 4(6):412–419, Attribute PCIMs that leverage label structure but do November 1996. not use conjoint modeling. [11] Slava M. Katz. Estimation of probabilities from While it would be of interest to compare the per- sparse data for the language model component of formance of C-PCIMs with other approaches such as a speech recognizer. IEEE Trans. Acoust., Spch., CTBNs, limitations in the currently available CTBN & Sig.. Proc., ASSP-35(3):400–401, March 1987. implementation prevented us from doing so. It would also be interesting to investigate approaches that ex- [12] Uri Nodelman, Christian R. Shelton, and Daphne tend CTBNs to allow them to take advantage of label Koller. Learning continuous time Bayesian net- structure in the manner of C-PCIMs. Another future works. In UAI, 2003. direction of interest is to investigate other extensions of PCIMs that allow parameter sharing across labels, [13] Adam Oliner and Jon Stearley. What supercom- such as hierarchical Bayesian approaches or regular- puters say - an analysis of five system logs. In ization approaches. IEEE/IFIP Conf. Dep. Sys. Net., 2007. [14] Lawrence R. Rabiner. A tutorial on hid- den Markov models and selected applications in speech recognition. Proc. IEEE, 77(2):257–286, February 1989. [15] Shyamsundar Rajaram, Thore Graepel, and Ralf Herbrich. Poisson-networks: A model for struc- tured point processes. In AIStats, 2005. [16] Vinayak Rao and Yee Whye Teh. Fast MCMC sampling for Markov jump processes and contin- uous time Bayesian networks. In UAI, 2011. [17] Vinayak Rao and Yee Whye Teh. Gaussian pro- cess modulated renewal processes. In NIPS, 2011. [18] Ardavan Saeedi and Alexandre Bouchard-Côté. Priors over recurrent continuous time processes. In NIPS, 2011. [19] Christian R. Shelton. Personal communication, 2012. [20] Christian R. Shelton, Yu Fan, William Lam, Joon Lee, and Jing Xu. Continuous time Bayesian net- work reasoning and learning engine. JMLR, 11, 2010. [21] Aleksandr Simma, Moises Goldszmidt, John Mac- Cormick, Paul Barham, Richard Brock, Rebecca Isaacs, and Reichard Mortier. CT-NOR: Repre- senting and reasoning about events in continuous time. In UAI, 2008. [22] Aleksandr Simma and Michael I. Jordan. Model- ing events with cascades of Poisson processes. In UAI, 2010. [23] Wilson Truccolo, Uri T. Eden, Matthew R. Gel- lows, John P. Donoghue, and Emery N. Brown. A point process framework relating neural spik- ing activity to spiking history, neural ensemble, and extrinsic covariate effects. J. Neurophysiol., 93:1074–1089, 2005. [24] Duy Q. Vu, Arthur U. Asuncion, David R. Hunter, and Padhraic Smyth. Continuous-time regression models for longitudinal networks. In NIPS, 2011. [25] M. Yuan and Y. Lin. Model selection and estima- tion in regression with grouped variables. J. Roy. Stat. Soc., Ser. B, 68(1):49–67, 2006.