Discovering Customer Journey Maps using a
              Mixture of Markov Models

 Matthieu Harbich1,2 , Gaël Bernard1 , Pietro Berkes2 , Benoît Garbinato1 , and
                             Periklis Andritsos3
 1
     University of Lausanne, Faculty of Business and Economics (HEC), Switzerland,
                           [firstname].[lastname]@unil.ch
             2
               NAGRA Kudelski, Switzerland, pietro.berkes@nagra.com
                3
                  University of Toronto, Faculty of Information, Canada,
                           periklis.andritsos@utoronto.ca


        Abstract. Customer Journey Maps (CJMs) summarize the behavior of
        customers by displaying the most common sequences of steps they take
        when engaging with a company or product. In many practical applications,
        the challenge lies in automatically discovering these prototypical sequences
        from raw event logs for thousands of customers. We propose a novel,
        probabilistic approach based on a mixture of Markov models and show it
        can reliably extract CJMs with just one input parameter (and potentially
        none).

        Keywords: markov models, customer journey mapping, event logs, cus-
        tomer journey analytics


1     Introduction

Managing customer experience is becoming increasingly complex. The plethora
of ways in which a customer can consume a service, the acceleration of media and
channel fragmentation, and the increase in customer-to-customer interactions
through social media create opportunities, but also present major challenges
to companies [7]. Customer Journey Maps (CJMs) respond by proposing an
intelligible visualization that makes it possible to understand, discuss, or improve
the main paths (i.e., journeys) in the usage of a service.
    In practice, one is usually faced with a log of raw events generated from
thousands of users interacting with a system. Each of these events symbolizes a
step a customer takes when using a service, while a single customer’s journey
might be composed by several steps (i.e., events). This yields the following
challenge: how to build a CJM that summarizes well the journeys observed in
the event logs? Figure 1 illustrates this challenge by using an event log composed
of five journeys Ê. As shown in Fig. 1, displaying all individual journeys on a
single CJM leads to an unreadable model Ë. One option, proposed by Bernard
and Andritsos [2], is to summarize individual journeys with a few representative
journeys, and assign each journey to its closest representative Ì. To this end, they


                                            3
      1 Event logs     2 CJM with actual journeys          3 CJM with representative journeys
     (e.g., XES)      A 4                                  A   1, 3, 4
      1 B D A         B 13                                 B
      2 E C G A       C                                    C
      3 B D A F       D                                    D
      4 B D F         E 5                                  E
                         2                                      5, 2
      5 E C G C A     F                                    F
                      G                                    G
                                                    time                                   time


Fig. 1. An event log containing five journeys, where each letter represents an activity
type Ê, a CJM with five actual journeys Ë, and its equivalent using two representative
journeys Ì.


take advantage of hierarchical clustering to organize the journeys and, then, they
apply a frequent sequence mining algorithm to define representative journeys.
    We contribute by proposing the first probabilistic approach that converts
event logs in a CJM. We first describe our approach and, then, we illustrate
its most appealing features – missing in [2] – using the Chicago activity survey
dataset.


2    Approach

The starting point for our approach is to assume that the customers belong to a
small number of groups, each having a similar behavior in the sense of having
the same probability of transitioning from one event to the next.
   We describe our approach in more details below: 1) the input, 2) the Markov
model, and 3) the use of an Expectation-Maximization (EM) algorithm on a
mixture of Markov models.


Input. Event logs are the input data for our algorithm. For example, logs can be
provided in the XES format, a standard representation used in Process Mining
to model processes [1], which can be mapped to store customer journeys [3].
The only other input is the number of representative journeys one would like to
discover, k .


Markov Model as an internal representation. Our model for event se-
quences is a Markov model: according to the model, each sequence is generated
by choosing the first event in the sequence, X1 , with probability P (X1 = i) = fi .
The probability of every successive event in the sequence depends only on the
event preceding it, i.e. P (Xt = i|Xt−1 , Xt−2 , ..., X1 ) = P (Xt = i|Xt−1 = j) = Tij .
By choosing the parameters f and T to match the probability of the first event
and the probability of event transitions for a group of observed sequences, the
model captures our assumption that journeys belonging to the same group share
the same probability of transitioning between events.


                                                4
Expectation-Maximization on a mixture of Markov models. Computing
one Markov model for the entire dataset would only describe one generic type
of behavior. Our assumption is that we have a set of k groups of customers,
each with its own model, i.e., we have a mixture of k Markov models. Given
a set of journeys, and the set of k pairs of parameters (Tk , fk ) for each group,
the probability of the data given the parameters (known as the likelihood of the
parameters) can be written as
                                     N X
                                     Y K
               LH(Tk , fk , πk ) =             P (X n |Ck ; Tk , fk )P (Ck ; πk )   (1)
                                     n=1 k=1

where
 N           : The set of sequences
 K           : The number of Markov models
 P(X n |Ck ) : The probability of sequence X n under model Ck
 P(Ck )      : The probability of model k (one group of behavior might be more
               likely than another

    The parameters Tk , fk , and πk are learned by Maximum Likelihood, i.e., by
maximizing the probability of the data under the mixture, LH(Tk , fk , πk ). An effi-
cient way of achieving that is using the Expectation-Maximization algorithm [5,6].
The expectation part evaluates the models’ responsibilities for the behavior using
the current parameters. In the maximization part, the algorithm re-computes
the parameters using the responsibilities of the different models computed in the
expectation part. The algorithm loops until the likelihood converges [4].
    The K discovered models, each composed by a transition matrix and a first
probability vector, represent the K different types of behaviors the dataset
contains. Our approach offers interesting features that will be illustrated in the
next section.


3     Case study

The probabilistically-grounded nature of the algorithm offers many advantages
compared to the approach described in [2], which is based on hierarchical cluster-
ing of edit distances of the underlying journeys. We illustrate the advantages of
our approach by applying our algorithm to a real, publicly available dataset. We
first describe the dataset and then list four interesting features of this probabilistic
approach.
    The dataset consists of an activity survey for the city of Chicago 1 . It comprises
29,542 journeys (2,381 are unique). Typically, one journey might be: “being at
home” → “going to school” → “having a meal” → “being at home”. Such a journey
has a length of four events. The average length for the dataset is 4.817 events
and there are 16 types of event. Fig. 2 shows the resulting CJM with k=3.
1
    http://www.cmap.illinois.gov/data/transportation/travel-survey


                                               5
         Activities:           1   Most likely representative          2 Second most likely
                                   journeys                              representative journeys
                  Work

            Entertainment

          Routine / shopping

            Home activities
                                                                time                                      time
                                       LEGEND
                                                Cluster 1              Cluster 2              Cluster 3


Fig. 2. Results when applied on the Chicago activity survey dataset with k=3. Ê shows
the most likely representative journeys seen in the event logs for each cluster and Ë
shows the second most likely representative journeys.


Feature 1. The first interesting feature resides in the ability to determine the
likelihood of a given journey under each model (see Fig. 2). Each journey is
assigned to a model k with a given probability, P (X|Ck ). The journey with the
highest probability becomes the representative. It should be noted that this also
means that we can return other alternatives (e.g., the second most likely journeys
as seen on Fig. 2), which can be useful for investigating how a journey might
deviate from the most likely representative journey.

Feature 2. The next interesting feature is the limited number of parameters.
The work proposed in [2] includes numerous parameters that must be chosen from
(e.g., the distance between journeys, and the method for finding the representative
or computing the distance between clusters), which can have a substantial impact
on the representative journeys that are returned. In contrast, our algorithm
requires us to provide only the number of groups, k. Even this parameter can be
removed by using standard model selection techniques (e.g., using a Bayesian
Information Criterion (BIC) penalty), but such a technique is outside the scope
of this short paper.

Feature 3. The next most probable event of an incomplete journey can be
computed on the fly by checking the most probable transition parameter under
its most likely model (see Fig. 3). Moreover, one can also use the transition matrix
to generate new journeys that have never been seen but that are still likely to
happen according to the mixture of models. Generating such data journeys can
be useful for training or analysis purposes.

Feature 4. One of the interesting features of our probabilistic-approach is the
way it handles shifting customers’ behavior. For example, an incomplete journey
is assigned to a Markov model with a given probability. The most probable
Markov Model might change when new events are observed. Hence, it allows us
to dynamically cope with shifting behaviors. A customer’s behavior is complex
[7] and, therefore, we argue that assigning the customer to a continuum between,


                                                     6
                  Activities:           Observed events       Prob. of
                                                             next event
                     Visiting friends                              lh: 9%

                     Entertainment
                                                                    lh: 16%

                           Work                                     lh: 17%

                     Home activities                                lh: 2%

                                                          (others) lh: 56% time


    Fig. 3. Inferring the most likely next event based on a set of observed sequences.


potentially several different models is more appropriate than using a discrete
classification.
    Our algorithm scales well as it has been applied – as part of the collaboration
with Kudelski – on a real dataset consisting of more than 7 million events for
more than half a million customers. It results in the discovery of several customer
behavioral patterns that enable marketers to design more personalized campaigns.
    Overall, a probabilistic approach offers a robust alternative based on a small
number of assumptions and with a small number of parameters. The four features
we presented in this paper might serve as a starting point for studying alternative
probabilistic approaches to discovering sets of journeys to be displayed on CJMs
using standardized event logs.

4     Acknowledgement
This research was conducted at NAGRA Kudelski (Lausanne, Switzerland). We
would like to thank the Kudelski Group for the fruitful collaboration and support.

References
1. van der Aalst, W.: Process Mining: Data Science in Action. Springer (2016)
2. Bernard, G., Andritsos, P.: Cjm-ex: Goal-oriented exploration of customer journey
   maps using event logs and data analytics. In: 15th International Conference on
   Business Process Management (BPM2017) (2017)
3. Bernard, G., Andritsos, P.: A process mining based model for customer journey
   mapping. In: Proceedings of the Forum and Doctoral Consortium Papers Presented
   at the 29th International Conference on Advanced Information Systems Engineering
   (CAiSE 2017) (2017)
4. Bishop, C.M.: Pattern recognition and machine learning. Information science and
   statistics, Springer, New York (2006)
5. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete
   data via the EM algorithm. Journal of the Royal Statistical Society. Series B
   (methodological) pp. 1–38 (1977)
6. Kiran, P.M., Rao, A.P., Ratnamala, B.: An efficient approach for filling incomplete
   data http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.3758&
   rep=rep1&type=pdf
7. Lemon, K.N., Verhoef, P.C.: Understanding customer experience throughout the
   customer journey. Journal of Marketing (2016)


                                             7