=Paper= {{Paper |id=Vol-1880/paper4 |storemode=property |title=Characterising the Difference and the Norm between Sequence Databases |pdfUrl=https://ceur-ws.org/Vol-1880/paper4.pdf |volume=Vol-1880 |authors=Frauke Hinrichs,Jilles Vreeken }} ==Characterising the Difference and the Norm between Sequence Databases== https://ceur-ws.org/Vol-1880/paper4.pdf
    Characterising the Di↵erence and the Norm
           between Sequence Databases

                           Frauke Hinrichs and Jilles Vreeken

       Max-Planck Institute for Informatics and Saarland University, Germany
                       {hinrichs,jilles}@mpi-inf.mpg.de



        Abstract. In pattern set mining we are after a small set of patterns
        that together are characteristic for the data at hand. In this paper we
        consider the problem of characterizing not one, but a set of sequence
        databases, such as a collection of articles or the chapters of a book. Our
        main objective is to find a set of patterns that captures the individual
        features of each database, while also finding shared characteristics of any
        subset of the data we are interested in.
        We formulate this problem in terms of MDL, and propose SqsNorm, an
        efficient algorithm to extract high quality models directly from data. We
        devise a heuristic to quickly find and evaluate candidates, as well as a
        model that takes shared and singular patterns into account.
        Experiments on synthetic data confirm that SqsNorm ably reconstructs
        the ground truth model. Experiments on text data, including a set of
        speeches, several book chapters, and a collection of songs, show that
        SqsNorm discovers informative, non-redundant and easily interpretable
        pattern sets that give clear insight in the data.


1    Introduction
In this paper we propose SqsNorm, a method to characterise the di↵erence and
norm between multiple sequence databases in terms of sequential patterns. To
make this less abstract, we will consider the case of analysing fairy tales. Assume
we pick a set of fairy tales and analyze it with the goal of finding expressions
that are typical for one or more of these stories. For example, most of them will
have a happy ending. Some might even employ the catchy “and they lived happily
ever after ”. We are likely to find a healthy share of king and queens, beautiful
daughter s and evil stepmother s. We will also notice expressions (or patterns, if
you will) that have their grand entrance only once. Imagine candidates like Snow
White or the Seven Little Kids, who are both very prominent in their respective
tales but have no other appearances elsewhere.
    The goal of SqsNorm is to discover exactly such a result. We want to find
expressions that characterize a chosen subset of fairy tales, as well as patterns
that are typical for only one story. It is worth mentioning that we had quite the
head start, since we have an intuition of things like a “common expression” and
an algorithm does not. Introducing a frequency threshold would lead to a sum-
mary far longer than the original data, as well as many unwanted redundancies.

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Skopje, Macedonia, 2017.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
50      F. Hinrichs and J. Vreeken

This e↵ect stems from the fact that every part of a frequent pattern is frequent,
and every pattern has exponentially many parts, and is also called the “pattern
explosion”. Nobody wants a summary that is longer than the object that was
originally summarized, and at this point, it would be shorter and cheaper to
just read the fairy tale itself and be done with it. To avoid the pitfall of the
pattern explosion, we will make use of the Minimum Description Length (MDL)
principle to guide us to small, succinct, and non-redundant descriptions. The
general idea is that if there is regularity in the data, anything other than a truly
random sequence of symbols, then we can make use of this to compress our data.
Regularity is equated with information.
    There exist algorithms, such as Sqs [15], and ism [5], to discover good sum-
maries of a given single event sequence database. There does exist an algorithm
for jointly summarising multiple databases, called DiffNorm, but as it only
considers transaction data it cannot tell the di↵erence between the wolf eats
grandmother and grandmother eats the wolf. In this paper, we draw inspiration
from both Sqs and DiffNorm, and propose SqsNorm, to efficiently discover
meaningful pattern sets that joinly describe the di↵erence and the norm between
multiple sequence databases.
    This paper is structured as usual. After quickly covering notation and in-
troducing the main concepts, we will specify how to encode data and model
for our system. We proceed to discuss finding good pattern candidates, and fi-
nally propose our algorithm SqsNorm. Through extensive evaluation we show
that SqsNorm discovers good results on both synthetic and real world data,
including some interesting results found when analyzing the inaugural addresses
of American presidents (arguably a close enough alternative to the fairy tale
example) and some of the works of J.R.R Tolkien.


2     Preliminaries

In this section we introduce the notation we will use throughout this paper,
give a brief primer to the Minimum Description Length principle, and a short
introduction to the Sqs algorithm.


2.1   Basic Notation

The data we are working with is a bag D of d = |D| databases. A single database
D 2 D consists of |D| sequences S, and a sequence S consists of |S| consecutive
events e drawn from an alphabet I. We write ||D|| for the number of all events in
D and naturally extend this to denote the number of all events in all databases
as ||D||. Let supp(e|S) denote the number of times an event occurs in a sequence.
We refer to thisPas the support of an event in a sequence, and accordingly define
supp(e|D) = P S2D supp(e|S) as the support of an event in a database, and
supp(e|D) = D2D supp(e|D) as the support of an event in all databases. A
pattern X is a sequence of k = |X| events e 2 I. A pattern X occurs in a
sequence S if some subsequence of S is equal to X. Note that we allow gaps of
      Characterising the Di↵erence and the Norm between Sequence Databases       51

(consecutive) singleton events between the events of a pattern. We call a tuple
(i, j, X) a window for pattern X in a sequence. Indexes i and j specify start
and end of the window. A window w is minimal if no other subsequence of w
contains X.


2.2     The MDL principle

The Minimum Description Length (MDL) Principle is a guide to finding struc-
ture in data. It was first introduced by Rissanen [11] and attempts to achieve an
incarnation of the (incomputable) Kolmogorov complexity [8]. The basic idea is
that information corresponds to compression. The compressors used are called
models, and after choosing a model class M for a problem setting, the models
M 2 M can be compared with respect to their ability to compress. The better a
model compresses our data, the more information it incorporates. The principle
tells us that if we want to find the best model M for some data D, we should
pick the one that minimizes the sum

                               L(M ) + L(D|M )     .

The first term describes the number of bits used for the encoded model, while
L(D|M ) represents the number of bits required to encode the data using the
information gained from M . Both terms should ideally keep each other in the
balance, preventing overly complex, but expensive models, as well as cheap, but
uninformative ones. Note that we require the compression to be lossless in order
to fairly compare models of the same model class. Moreover, as in MDL we are
interested in MDL theoretical complexity we are only interested in (optimal)
code lengths, not actual instantiated code words.


2.3     Summarising a Single Database

Tatti and Vreeken [15] recently proposed the Sqs algorithm (short for Summaris-
ing event seQuenceS) to efficiently discover good summaries of a single event se-
quence database, using the MDL principle to identify the optimal model. As we
will build upon and extend the Sqs algorithm, we will give a short introduction
to its main aspects below.
    The models Sqs uses are called code tables (CT ), which are lookup tables
between patterns, and their associated codes. To ensure lossless encoding of any
data over I, all singleton events are always included.
    The data is encoded as two separate streams Cp (pattern stream) and Cg (gap
stream). The first one contains the encoded data, represented as a sequence of
codes from the second column of the code table. The second stream is needed
since we have no information on potential gaps when reading a pattern code. It
is composed of fill-codes and gap-codes, and while decoding the data we read it
in parallel to the pattern stream to fill in the gaps. For more detailed information
about Sqs we refer to the original paper [15].
52      F. Hinrichs and J. Vreeken

     Data D: a b d c a d b a a b c
      Cp                                                 CT : a




                                                                              s
           p   d   a   q   b   p                                  a




                                                                               p
                                                                            ga
                                                              b   b




                                                                       no s
                                                                         n-
      Cg




                                                                         p
                                                                      ga
                               gap
                                                              c   c
                   gap                                        d   d
                   a b d c a d b a a b c                   abc    p
     alignment     p                 q         p            da    q



     Fig. 1: Toy example of a possible encoding, using patterns abc and da


Encoding the data For a pattern X in the code table, let us write usg(X) to
denote how often the code of this entry is used in the encoding. Intuitively, the
more often a code is used, the shorter it should be.
    Shannon entropy [4] tells us that the shortest prefix code of a pattern cor-
responds to the negative logarithm of its probability, which in our case is its
relative occurrence. We define the length of a pattern-code as follows:
                                           ✓                 ◆
                                                 usg(X)
                 L(code p (X|CT )) = log     P                 .
                                               Y 2CT usg(Y )

    The length of the pattern stream Cp can now be defined intuitively. Since
the complete data has to be covered with codes of either patterns or single-
ton events, the
              P cost of a single code is added each time this code is used.
L(Cp |CT ) =     X2CT usg(X)L(code p (X)). The same strategy can be applied
for Cg — obviously, the gap code of a pattern should become shorter if gaps in
this pattern are likely.
    The overall encoded length of a database D given a code table CT is then
                                X
       L(D | CT ) = LN (|D|) +      LN (|S|) + L(Cp | CT ) + L(Cg | CT )
                                         S2D

where we first encode the number of sequences in the database, as well as the
length of each sequence, using the MDL optimal universal code for integers
[12], which is defined for n > 1 as LN (n) = log⇤ (n) + log(c0 ), where log⇤ (n) =
log(n) + log log(n) + . . . . To make it a valid encoding, i.e. to satisfy the Kraft
inequality, we set c0 to 2.865064.

Finding the optimal alignment To encode a sequence database we need an
alignment of which codes are used when. In particular, we want that alignment
that encodes our data most succinctly. Whether a pattern should be used de-
pends on its code length, however, which in turn depends on how often we use
it. There are exponentially many alignments for a sequence database given a
set of patterns, and hence exhaustive search is not possible. Tatti and Vreeken
therefore propose a heuristic solution, which we call SqsAlign for the remain-
der of this paper. SqsAlign exploits the fact that fixing an alignment leads to
computable code lengths and vice versa. In particular, given a set of patterns,
    Characterising the Di↵erence and the Norm between Sequence Databases          53

and initializing the usages to the supports of the patterns, SqsAlign alternates
between finding the most succinct alignment given the code lengths these usages
determine, and updating the code lengths accordingly, until convergence. As the
starting values for the usages are estimates, the algorithm will only find a local
minimum, yet experiments show that in practice this allows for the discovery
of good models. For more details about the Sqs aligning scheme we refer the
reader to Algorithm 2 in the original paper [15].

Mining good models This leaves us with the question of how to discover
good models. Sqs does so in a bottom-up fashion, starting with the singleton
only code table ST , and then iteratively searching for that candidate pattern
X that maximally improves the compression, until we cannot find no candidate
that improves the score and we halt. As candidates, it can either consider a
given pre-mined collection of serial episodes, or, mine good candidates on the fly
by considering combinations P Q of patterns P, Q 2 CT of already in the code
table. Evaluating the quality of every candidate is expensive, and therefore Sqs
rather first estimates the quality of a candidate, and evaluates the candidates in
estimated order of gain in compression.

Limitations of Sqs for our purpose Although Sqs discovers good summaries
for single databases, it does not suffice to jointly summarize multiple databases;
when we run it on the concatenation of all databases, it will likely forego local
detail, whereas when we run it on each of the individual databases it will not
be able to pick up locally infrequent yet globally important patterns. Ignoring
this, even concatenating these models will be a bad summary of the whole;
as Sqs is a heuristic it may be that two very similar databases might lead to
similar scores, yet very dissimilar pattern sets, and hence their concatenation
could produce yet another di↵erent result. This would take away our basis of
reasoning: instead of finding one set of patterns that describes well the entire set
of databases, we would end up with multiple (possibly unrelated) pattern sets
that each compress some subset of the data.
    Our approach is hence to instead find patterns from a unified set of candi-
dates, assuring consistent and comparable results.

3   Theory
Given the preliminaries above, we can now proceed to formalize our problem. To
this end we will first specify our model class M. A single code table, like for Sqs,
will not suffice in the setting of multiple databases. In our case, a model M 2 M
will be a set S of pattern sets Sj , each Sj characterizing one of the database
subsets the user is interested in. We will use the original Sqs encoding to obtain
L(Di | CT i ), which means that the receiver must be able to reconstruct code
tables CT1 , . . . , CTd from model M to decode each database.
    Let D be a set of d sequence databases over alphabet I. Furthermore, let J
= {1, . . . , d} be a set of indexes. Any index set j ⇢ J can then be identified with
54      F. Hinrichs and J. Vreeken

a set of databases {Di 2 D | i 2 j}. On receiving a set U of index sets as input,
our goal is to find a model M that compresses D. Specifically, for each of the
j 2 U we want to find a set of patterns Sj that succinctly describe the subset of
databases j is associated with.

Encoded length of the model
We now proceed to describe all the information necessary to encode the pattern
sets Sj and the additional information needed to build a code table CTi for each
Di . To encode a pattern set Sj , we first encode the number of patterns it contains
using the universal code for integers introduced in the previous chapter. For each
pattern in Sj we encode its length as well as the events of the pattern. Patterns
are encoded like in Sqs, with the singleton only code table over the concatenated
databases, ST ⌦ , that assigns optimal Shannon codes to all singletons according
to their global support in D.
                                  X ⇣               X                        ⌘
        L(Sj ) = LN (|Sj + 1|) +         LN (|X|) +     L(code p (x | ST⌦ ))
                                 X2Sj              x2X

The final cost for all the pattern sets then amounts to the sum of the costs of
all the individual pattern sets.
    With this information it is already possible to obtain the first column of the
code tables CT i . Let ⇡i (S) = {Sj 2 S | i 2 j} denote all pattern       S sets that
contain relevant non singleton patterns for a database Di , and Pi = ⇡i (S) the
set of patterns needed to encode Di . The entries of the first column of a code
table CT i then consist of the patterns in Pi , together with the singletons to
ensure lossless encoding.
    We still need to encode the information for the pattern, gap, and fill codes per
database to make this scheme work. Let usg(Pi ) denote the sum of all usages
of patterns in Pi . The length of the code columns Ci for code table CT i for
database Di is then defined as
                                           ✓                   ◆
                                             ||Di || + |I| 1
               L(Ci ) = LN (||Di ||) + log
                                                  |I| 1
                                                   ✓                    ◆
                                                     usg(Pi ) + |Pi | 1
                      + LN (usg(Pi ) + 1) + log
                                                          |Pi | 1
                         X
                      +       LN (gaps(X) + 1) ,
                       X2Pi

where the first two terms encode the support of each singleton in a database.
This is done using a weak number composition. The second part uses the same
technique to encode the usages for each pattern. Lastly, we encode the number
of gaps for each pattern to obtain the gap codes. The fill code of a pattern can
be computed knowing its length and usage, which we do at this point. Having
defined the encoding for both the pattern sets and the code tables, the final cost
for our model can now be computed as
   Characterising the Di↵erence and the Norm between Sequence Databases                55

                                                       P
                          L(M ) = LN (|I|) + L(S) +          L(Ci ) .
                                                       i2J

   For some database Di , it is now trivially possible to extract the corresponding
code table CTi from our model. For the leftmost column, we add the singletons
as well as the patterns in Pi which we know from S. Lastly, we fill in the codes
with the information Ci provides.

Encoded length of the data
Given a code table CT i for a database Di , we can simply re-use the encoding
scheme from Sqs. The pattern codes code p (X | CT i ) and gap codes code gap (X | CT i )
and code fill (X | CT i ) for a database Di correspond to the respective pattern X
in CT i . Each database is encoded as two streams Cp and Cg , with
                   X
 L(Cp |CT i ) =          usg(X)L(code p (X|CT i ))    , and
                 X2CT i
                  X
L(Cg |CT i ) =            gaps(X)L(code gap (X|CT i )) + fills(X)L(code fill (X|CT i )) .
                 X2CT i
                 |X|>1




The cost for a database Di then amounts to
                                 X
    L(Di | CT i ) = LN (|Di |) +   LN (|S|) + L(Cp | CT i ) + L(Cg | CT i ) .
                                     S2Di

L(D) is then obtained by adding up the codes for each Di 2 D. This score
enables us to evaluate how well di↵erent pattern sets compress our data, and
whether a candidate should be added to the model.

Problem definition
Having formalized our score, we can formally now state the problem we consider.
Minimal Serial Episode Composition Problem Let I be a finite alphabet
of events, D a bag of sequence databases over I, and U a set of index sets over
D. Further, let F be the set of all possible sequences over I of length at most
maxS2D |S|. Find that set S 2 P(F )|U | of sets of serial episodes that minimizes
                                     L(S) + L(D | S) .
 Regarding the complexity of this problem, we note that summarising a single
sequence database is already hard: the search space is exponential, and there
is no easily exploitable structure [15]. Unfortunately, our problem setting does
not make it easier. We now have to consider all possible episodes in D, and
all possible ways to divide them up into |U | many pattern sets. Exhaustively
searching for the optimal model is not feasible, and there is again no exploitable
structure in the search space. We therefore propose an algorithm that finds good
pattern sets on a heuristic basis.
56       F. Hinrichs and J. Vreeken

4     Algorithm
In the previous section we introduced MDL scores for model and data and defined
the problem we are facing. Having seen that an exact computation of the optimal
model is not feasible, we propose SqsNorm, a heuristic solution that allows us
to mine good models directly from data.

Estimating candidates We denote the number of bits needed to encode
database Di with code table CT i as L(Di , CT i ). Let us write L(Di , CT i X)
for the gain of adding a pattern X to the code table of a database Di .
                 L(Di , CT i   X) = L(Di , CT i )   L(Di , CT i   X)
To compute this di↵erence we use the Sqs encoding and compare the scores. We
have seen in the previous sections that Sqs estimates the gain of a candidate
and chooses the best extension P X for a pattern P based on this estimate.
    We propose a slightly altered version of the estimate algorithm used in Sqs.
Instead of returning only the extension with the highest gain dX it returns the set
of all possible extensions for P together with their scores. Its outline is shown in
Algorithm 1. This tweak allows us to compare scores of candidates over several


 Algorithm 1: Estimates(D, A, P )
   input : Database Di , current alignment Ai , pattern P 2 CT i
   output: All possible candidates P X and their estimated gain dX
 1 for windows v of P in Ai do
 2    for windows w of X that occur after v do
 3        dX    updateGain(P, X);
 4   return {(X, dX ) | X 2 CT i };



databases. Let us denote the estimated gain achieved by adding a pattern P
somewhere in our model as ˆL(D, S P ). Our goal is then to find candidates
with a high ˆL(D, S P ).
    For any candidate pattern P X where P and X are drawn from P [ I we
define a function ˆL(Di , CT i P X) to describe the estimated gain when PX
is added to the code table CT i for Di . This is not equivalent to dX , because dX
is only defined for entries P and X that have at least one active window in Di .
We set this new score to zero for patterns that either are not relevant for the
given database, or are expected to have a negative gain.
                            ⇢
    ˆL(Di , CT i P X) = dX if (X, dX ) 2 Estimates(Di , Ai , P ) ^ dX 0
                              0 otherwise.
Algorithm 2 gives an outline of how to find a good extension X for a given
pattern P in the multiple database setting.
     Characterising the Di↵erence and the Norm between Sequence Databases      57


 Algorithm 2: GetCandidate(D, A, P )
   input : Bag of d databases D, set A of alignments A1 , . . . , Ad , P 2 P
   output: Pattern P X with x 2 P and high ˆL(D, S P X)
 1 for Di 2 D do
 2    if P 2 CT i then
 3        cands i  {Estimates(Di , Ai , P )}
 4   for X 2 P do        ⇣                        ⌘
                P
 5      gain X    cands i ˆL(Di , CTi      P X)
 6   return P X with highest gainX ;



    For simplicity, we assume that a pattern will be added to all databases where
it has a positive gain dX , even though an index set representing these databases
might not even exist in U . The task of choosing the right Sj for a pattern is
addressed after the estimating process.


The SqsNorm Algorithm Since we can now create a list of candidates for our
model, we proceed by computing their actual value. Since this requires us to call
SqsAlign for each candidate and database, we only want to do an expensive
computation if it is possible for the candidate to exist. Consider the following
function :
                               ⇢
                                 1 if 9P1 , P2 2 CTi : P = P1 P2
                   (Di , P ) =
                                 0 otherwise.

    We give the pseudo-code of SqsNorm as Algorithm 3. We first get the pre-
liminary alignments by using SqsAlign without any patterns (line 1), and then
start by creating a batch of candidates (3) that we order by their estimated gain.
For each database in which a candidate may occur (5–6), we compute the code
lengths to obtain their actual gain (7). The SetCover section is inherited from
DiffNorm [2]: The set w contains the gain of a candidate P X for each index
set j as a weight (10), and we choose to add P X to those Sj that maximize
the gain (11–12). Since adding a new pattern to a code table might change the
usages of other patterns, or even make them redundant (imagine big bad becom-
ing dispensable as soon as big bad wolf has been found), each candidate check is
followed by a pruning step (13), in which we re-evaluate all patterns in the model
one by one, permanently removing them if this gives a gain in compression.


Complexity Last, we consider the computational complexity of SqsNorm.
Since Estimate takes O(|Pi | + |I| + ||Di ||) steps [15], GetCandidate has
a complexity of O(d ⇥ (|Pi | + |I| + ||Di ||)). The overall number of patterns
S in our model can be bounded by the number F of frequent serial episodes
over I. O(Sqs) for one database Di is O(|Pi | ⇥ ||Di ||) [15], and pruning takes
58      F. Hinrichs and J. Vreeken


 Algorithm 3: SqsNorm(D, U )
   input : Bag of d databases D, set of index sets U
   output: Model S= {S1 , . . . , S|U | } that compresses D well
 1 P    ;;S     {; | j 2 U }; A      {Sqs(Di , ;) | Di 2 D}
 2 do
 3    F    {GetCandidate(D, A ,P) | P 2 P, e 2 I}
 4    for P X 2 F ordered by gain X do
 5       for Di 2 D do
 6           if (Di , P X) then
 7              gain i      L(Di , Ci P X)
 8              else
 9                 gain i    0
                            P
10         w     {max(0, i2j gain i | j 2 U };
11         U0     WeightedSetCover(J , U, w);
12         S0    S with P X added to every Sj with j 2 U 0 and positive wj ;
13         S     Prune(D,S,S 0 );
14   while L(D, S) decreases;
15   return S;



O(|S|2 ⇥|D|) time [2]. All in all, SqsNorm executes GetCandidate, SqsAlign
        ⇣ at most |F| times, which results in a worst⌘ case complexity of
and Prune
O(|F| ⇥ (|F| ⇥ ||D||) + (|F| + ||D|| + d ⇥ |I|) + (|F|2 ⇥ |D|) ) < O(|F|3 ⇥ ||D||).
In practice, SqsNorm converges quickly, only evaluating few carefully chosen
candidates, calling SqsAlign only called where it makes sense to do so, and the
model is pruned to have only relevant patterns in the model at all times—in our
experiments SqsNorm takes up to a few minutes on reasonably sized data.


5     Related Work

Discovering interesting patterns from sequential data has a long history. Tra-
ditional approaches focus on mining all patterns that satisfy a local intererest-
ingness constraint. In sequential data, however, there exist multiple reasonable
definitions for support. Laxman et al. proposed to count the largest set of non-
overlapping windows using finite state automata, tracking the frequencies of
each episode with an automaton [7]. Their method discovers serial and parallel
episodes. Mannila at al. propose counting minimal windows, starting with sin-
gletons and building episodes of increasing length [9]. The algorithm relies on
the fact that a frequent episode implies that its sub-episodes are frequent and
requires window size and minimal support as user parameters. Tatti et al. intro-
duce strict episodes and define a subset relation between them, using directed
      Characterising the Di↵erence and the Norm between Sequence Databases            59

acyclic graphs to mine pattern candidates [14]. This approach also generates
candidate episodes from already discovered patterns.
    A related class of methods aims to discover those patterns that are prevalent
in one class, and rare in the other [6], or to mine such discriminative patterns
directly from data for classification purposes [3]. Quiniou et al. consider multiple
text corpora with the goal of discovering typical linguistic patterns for each of
them [10]. Our objective, on the other hand, is not to discriminate between
databases nor to describe them independently, but to jointly characterize sets of
databases the user is interested in.
    In this paper we do not consider mining all patterns, but rather return a
small set of patterns that together describe the given databases well. As far as
pattern set mining is concerned, Krimp [16] made the first step toward using
MDL in market basket analysis, selecting a small set of patterns that compress
a single database well out of a possibly huge set of frequent patterns. Slim [13]
later refined this approach, mining candidates directly from the data instead of
depending on an input set of candidates.
    In this paper we build upon Sqs [15], an MDL-based heuristic to mine de-
scriptive sets of serial episodes for a single sequence database. To encode the
data, it uses look-up tables for patterns and codes, and considers candidates
either from a pre-mined pool, or by iteratively considering combinations of pat-
terns in the current model. While this approach works well for a single dataset,
it cannot trivially be extended for multiple databases, since the heuristic nature
of the algorithm would not produce consistent candidates for each database.
    In a more recent approach, Squish enriches the concept of Sqs with a broader
pattern language [1]. In addition to serial episodes, Squish allows for choice
episodes, where a position in an episode may be filled with one of a predefined
choice of events. In addition, Squish adopts a greedy heuristic to find good
covers of the data. It will be interesting to see to what extent these ideas can be
carried over into SqsNorm.
    DiffNorm [2] is the other inspiration of this work. Its goal is to find di↵er-
ence and norm among itemset databases with the help of MDL, which we here
extend to the setting of sequential data; unlike SqsNorm, DiffNorm does not
consider order, repetitions and possible gaps, and as a result neither its score
nor search algorithm can be trivially applied, or adapted to sequential data.

6     Experiments
Next we proceed to empirically evaluate SqsNorm. To this end we consider
both synthetic and real world data. As real world data, for interpretability of
the results, we will consider text data. All experiments were run single-threaded
on a quad-core Intel machine with 8GB of memory, running windows.

6.1     Synthetic data
All synthetic datasets feature 10, 000 events per database. Events are drawn
from alphabet I = {0, . . . , 500}. Index set U includes pattern sets S1 , . . . , Sd for
60      F. Hinrichs and J. Vreeken


Dataset           #L      #G          L(D, S0 )    L%     |S|     =     [    ?    time(s)
Random              0      0        461 180 100         0      0 0 0             34
Local             50       0        460 612     97     50     50 0 0            161
Global              0     10        460 540     97     10     10 0 0            111
Mixture           10      90        906 089     71 100 100 0 0                  765
Table 1: Results on synthetic datasets. From left to right: number of locally (# L)
and globally (# G) planted patterns, length using the empty model (L(D, S0 )),
                           L(D,S)
compression rate (L% = L(D,S    0)
                                   ), total number of patterns (|S|), exactly re-
covered patterns assigned to the correct Sj (=), concatenations of patterns ([),
patterns that are unrelated to any planted patterns (?).



each individual database as well as a global pattern set S⌦ . In the following we
describe four basic datasets we used to ascertain the most important properties
of SqsNorm.
Random To make sure our algorithm does not pick up noise, we create the
Random dataset, consisting of five databases. It contains no information, since
events are picked uniformly at random.
Local For this dataset, we set up five databases and plant ten unique patterns
per database (all in all 50 unique patterns: (1, . . . , 5) to (246, . . . , 250)). In this
case we would like to find all the planted patterns in their respective Sj , while
S⌦ has to stay empty.
Global We then proceed by planting 10 global patterns in five databases. Pat-
terns consist of randomly chosen events from I, which allows for arbitrary pat-
terns that might have repetitions or shared events. This time, the pattern should
be identified as typical for all databases and put in S⌦ .
Mixture In the Mixture set we run SqsNorm on a set of 10 databases and
plant a total of 100 patterns, 90 of which are global, while each database retains
exactly one local pattern. Patterns are created exactly as for the global set.
Table 1 shows results for each of the introduced datasets. SqsNorm correctly
identifies the planted patterns and assigns them to the pattern set that fits them
without picking up noise. These results do not change even if gaps occur with a
probability of up to 50%, which doubles the average pattern cover.


6.2   Real data

We now proceed to test SqsNorm on some real world datasets. We first explain
the data structure and preprocessing methods and then present some results.
Addresses per Era First, we regard the Addresses dataset. It consists of the
presidential inaugural addresses from George Washington to Donald Trump.
The individual words are stemmed and stopwords are removed. Each word in
   Characterising the Di↵erence and the Norm between Sequence Databases         61


Dataset                  |D|       |I|         ||D||    |J |    |S⌦ |    avg(|Sj |)
Addresses/Era              67   5386         64035       8       14        21.5
Addresses                  67   5386         64035      57        4        1.37
LOTR1-12                   12   4060         44347      12        8           6
Hobbit                     19   4021         45650      19        7        2.74
Songs                       4      26         3223       4       13        7.75
Table 2: Statistics about the datasets used. From left to right: number of se-
quences, size of the alphabet, overall number of events, number of databases,
size of S⌦ , the average number of patterns representing an individual database.




the resulting corpus is seen as an event, and we group the addresses into eight
eras of American history, regarding each of these eras as a database and the
individual addresses as sequences. Figure 2 shows some results for each era and
S⌦ .
Addresses individual For comparison, we then treat each inaugural address
as a database and run SqsNorm again. We would expect S⌦ to shrink, since
57 individual inaugural addresses have probably less patterns in common than
entire eras from the previous experiment. Also, it would be natural for the indi-
vidual Sj to become smaller, since a single address is not likely to have more than
two or three recurring slogans. The experiments confirm this e↵ect. S⌦ shrinks,
and the average number of patterns per individual address goes down to about
1. This often corresponds to one special catchphrase or slogan used during the
address. Interestingly, these slogans were often not found in the Era dataset,
because it is too costly to call such a pattern era-typical when it only occurs
during one speech. Some patterns that were only found on an address-wise run
of SqsNorm are deliver up (Lincoln), new spirit (Carter), put out [my] hand
(Bush) and make America [. . . ] again (Trump).
LOTR1-12 To obtain more meaningful results, we analyze the first twelve chap-
ters of J.R.R. Tolkien’s Lord of the Rings: The Fellowship of the Ring. The text is
stemmed and stopwords are removed. We regard individual chapters as sequence
databases containing a single sequence. Quite correctly, SqsNorm find general
themes like “Mr Bilbo” or “Sam Gamgee” in S⌦ . Characteristics of each chapter
include chapter-bound events like Bilbo’s eleventy-first birthday, and new con-
cepts that are introduced and explained during a chapter such as the one ring.
Figure 3 shows patterns found for selected chapters.
The Hobbit We then proceed to explore J.R.R Tolkien’s book The Hobbit in
the same way. Again, the patterns found conform to our expectations. Local
patterns correspond to specific scenes or places (dining room, birthday present,
lake town), while the main characters’ names and concepts that weave through
the entire story (King under the Mountain, Mr Baggins, Fili Kili,. . . ) are found
in S⌦ .
62          F. Hinrichs and J. Vreeken


                                  1789


            Inland Frontier
            Foreign Nation
                                    Civil Servic[e]
                                    Social Order


                                                                   fellow citizen
     Interstate Commerc[e]
     Race Feel                                                     arm[y] nav[y]

                                    Eighteenth Amend[ment]         american peopl[e]
                                    Western Hemisphere
                                                                   unit[ed] state[s]
                                                                   men women
        Nuclear Weapon                                             almighty[y] god
        Both Sides

                                    21st Centur[y]
                                    Health Care




                          today



Fig. 2: Results of the Addresses dataset when divided by era. Left: patterns
clustered by era they are typical for. Right: typical patterns for all eras


Songs To evaluate SqsNorm on a somewhat di↵erent scale, we finally test it
on a collection of four songs taken from the top of the charts around Christmas.
Each song is understood as a database containing a single sequence. Instead of
comparing them word by word, we choose to analyze them letter by letter, with
I = {a, . . . , z} as our alphabet. This results on the one hand in global stopwords,
like and and this, which do not contribute much interesting information. We find
one non-stopword pattern in S⌦ : Love is a common pattern for all chosen songs.
On the other hand we find long pieces of refrains (up to 32 letters) in the local
pattern sets, often catching the most memorable bits of the song.


7       Discussion
The experiments show that SqsNorm correctly returns good sets of informative
patterns. By analyzing synthetic datasets we could show that true patterns are
correctly found up to very high gap levels. Furthermore, global patterns are
correctly identified as such while noise is not picked up at all. Results on real
datasets are interesting and plausible, and we discover universally used phrases
and themes alike. By experimenting with the Songs dataset we have seen that
SqsNorm can also be applied on a character-wise level. It picks up large parts of
the songs’ main refrains, as well as frequently used stopwords and even identifies
love as a common theme they share.
    Characterising the Di↵erence and the Norm between Sequence Databases        63


                good bye
         one hundred forty four
    chubb bolger bracegirdl brockhous
               proudfoot
              young hobbit              CH1
              eleventy one

            land [of] mordor
              gladden field
                 one ring               CH2
                                                                mr baggins
               black rider
                                                                 long ago
               mr peregrin
              golden perch                                      brandy hall
                short cut               CH4
                                                                 mr bilbo
             tom bombadil
                                                               sam gamgee
             ring ding dillo
              nightly noise                                      bag end
                 hill top               CH7



               tin uviel
               gil galad
                hill top
           answer[ed] strider           CH11




              house elrond
                                        CH12




Fig. 3: Results of the LOTR1-12 dataset. Left: Typical patterns for chosen chap-
ters. Right: typical expressions for all chapters


    Despite these very encouraging results, there are several possibilities to fur-
ther improve SqsNorm. Candidate estimation can be trivially parallelized by
concurrently evaluating candidates. Moreover, since estimates for each database
are calculated independently, all of these computations can be parallelized on a
database-level as well. Lastly, we are currently using prefix codes for our encod-
ing. There is a more refined approach called prequential coding, which allows
us to transmit the data without having to encode the usages of the patterns
explicitly, and hence helps avoid any possible bias in the encoding of the model.
The key idea of prequential coding is that, starting from a uniform distribution
over the usages, we can update these counts with every code that we receive, and
so rapidly approximate the true usage distribution. This has been successfully
used in DiffNorm [2], and while our results show it is not necessary to obtain
meaningful results, it will be interesting to see if these improve when we do.


8    Conclusion
In this paper we studied the problem of discovering di↵erence and norm between
sequence databases. In order to do so, we proposed SqsNorm, an algorithm that
uses MDL to find sets of patterns that represent both local and global features of
the data. Drawing inspiration from Sqs and DiffNorm, we defined an encoding
for our data and proposed a way to mine good candidates directly from data.
64      F. Hinrichs and J. Vreeken

Evaluating SqsNorm, we have seen that our system works well in practice, does
not pick up noise and is fast. SqsNorm allows to extract potentially interest-
ing information with regard to what is the norm, and what are the di↵erences
between sequences databases in general, and as we have shown, text data in par-
ticular. In the future we plan to further explore its application for exploratory
analysis in computational linguistics.

Acknowledgements
The authors are supported by the Cluster of Excellence “Multimodal Computing
and Interaction” within the Excellence Initiative of the German Federal Gov-
ernment.

References
 1. A. Bhattacharyya and J. Vreeken. Efficiently summarising event sequences with
    rich interleaving patterns. In SDM. SIAM, 2017.
 2. K. Budhathoki and J. Vreeken. The di↵erence and the norm – characterising
    similarities and di↵erences between databases. In ECML PKDD. Springer, 2015.
 3. H. Cheng, X. Yan, J. Han, and P. S. Yu. Direct discriminative pattern mining for
    e↵ective classification. In ICDE, pages 169–178, 2008.
 4. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-
    Interscience New York, 2006.
 5. J. Fowkes and C. Sutton. A subsequence interleaving model for sequential pattern
    mining. In KDD, 2016.
 6. X. Ji, J. Bailey, and G. Dong. Mining minimal distinguishing subsequence patterns
    with gap constraints. Knowl. Inf. Syst., 11(3):259–286, April 2007.
 7. S. Laxman, P. S. Sastry, and K. P. Unnikrishnan. A fast algorithm for finding
    frequent episodes in event streams. In KDD, pages 410–419. ACM, 2007.
 8. M. Li and P. Vitányi. An Introduction to Kolmogorov Complexity and its Applica-
    tions. Springer, 1993.
 9. H. Mannila, H. Toivonen, and A. Inkeri Verkamo. Discovery of frequent episodes
    in event sequences. Data Min. Knowl. Discov., 1(3):259–289, January 1997.
10. S. Quiniou, P. Cellier, T. Charnois, and D. Legallois. What about sequential data
    mining techniques to identify linguistic patterns for stylistics? In CICLing, pages
    166–177. Springer-Verlag, 2012.
11. J. Rissanen. Modeling by shortest data description. Automatica, 14(1):465–471,
    1978.
12. J. Rissanen. A universal prior for integers and estimation by minimum description
    length. Annals Stat., 11(2):416–431, 1983.
13. K. Smets and J. Vreeken. Slim: Directly mining descriptive patterns. In SDM,
    pages 236–247. SIAM, 2012.
14. N. Tatti and B. Cule. Mining closed episodes with simultaneous events. In KDD,
    pages 1172–1180, 2011.
15. N. Tatti and J. Vreeken. The long and the short of it: Summarizing event sequences
    with serial episodes. In KDD, pages 462–470. ACM, 2012.
16. J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: Mining itemsets that compress.
    Data Min. Knowl. Disc., 23(1):169–214, 2011.