<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Characterising the Di↵erence and the Norm between Sequence Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frauke Hinrichs</string-name>
          <email>hinrichs@mpi-inf.mpg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jilles Vreeken</string-name>
          <email>jilles@mpi-inf.mpg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Max-Planck Institute for Informatics and Saarland University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>49</fpage>
      <lpage>64</lpage>
      <abstract>
        <p>In pattern set mining we are after a small set of patterns that together are characteristic for the data at hand. In this paper we consider the problem of characterizing not one, but a set of sequence databases, such as a collection of articles or the chapters of a book. Our main objective is to find a set of patterns that captures the individual features of each database, while also finding shared characteristics of any subset of the data we are interested in. We formulate this problem in terms of MDL, and propose SqsNorm, an ecient algorithm to extract high quality models directly from data. We devise a heuristic to quickly find and evaluate candidates, as well as a model that takes shared and singular patterns into account. Experiments on synthetic data confirm that SqsNorm ably reconstructs the ground truth model. Experiments on text data, including a set of speeches, several book chapters, and a collection of songs, show that SqsNorm discovers informative, non-redundant and easily interpretable pattern sets that give clear insight in the data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper we propose SqsNorm, a method to characterise the di↵erence and
norm between multiple sequence databases in terms of sequential patterns. To
make this less abstract, we will consider the case of analysing fairy tales. Assume
we pick a set of fairy tales and analyze it with the goal of finding expressions
that are typical for one or more of these stories. For example, most of them will
have a happy ending. Some might even employ the catchy “and they lived happily
ever after ”. We are likely to find a healthy share of king and queens, beautiful
daughter s and evil stepmother s. We will also notice expressions (or patterns, if
you will) that have their grand entrance only once. Imagine candidates like Snow
White or the Seven Little Kids, who are both very prominent in their respective
tales but have no other appearances elsewhere.</p>
      <p>The goal of SqsNorm is to discover exactly such a result. We want to find
expressions that characterize a chosen subset of fairy tales, as well as patterns
that are typical for only one story. It is worth mentioning that we had quite the
head start, since we have an intuition of things like a “common expression” and
an algorithm does not. Introducing a frequency threshold would lead to a
summary far longer than the original data, as well as many unwanted redundancies.
This e↵ect stems from the fact that every part of a frequent pattern is frequent,
and every pattern has exponentially many parts, and is also called the “pattern
explosion”. Nobody wants a summary that is longer than the object that was
originally summarized, and at this point, it would be shorter and cheaper to
just read the fairy tale itself and be done with it. To avoid the pitfall of the
pattern explosion, we will make use of the Minimum Description Length (MDL)
principle to guide us to small, succinct, and non-redundant descriptions. The
general idea is that if there is regularity in the data, anything other than a truly
random sequence of symbols, then we can make use of this to compress our data.
Regularity is equated with information.</p>
      <p>
        There exist algorithms, such as Sqs [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and ism [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], to discover good
summaries of a given single event sequence database. There does exist an algorithm
for jointly summarising multiple databases, called DiffNorm, but as it only
considers transaction data it cannot tell the di↵erence between the wolf eats
grandmother and grandmother eats the wolf. In this paper, we draw inspiration
from both Sqs and DiffNorm, and propose SqsNorm, to eciently discover
meaningful pattern sets that joinly describe the di↵erence and the norm between
multiple sequence databases.
      </p>
      <p>This paper is structured as usual. After quickly covering notation and
introducing the main concepts, we will specify how to encode data and model
for our system. We proceed to discuss finding good pattern candidates, and
finally propose our algorithm SqsNorm. Through extensive evaluation we show
that SqsNorm discovers good results on both synthetic and real world data,
including some interesting results found when analyzing the inaugural addresses
of American presidents (arguably a close enough alternative to the fairy tale
example) and some of the works of J.R.R Tolkien.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>In this section we introduce the notation we will use throughout this paper,
give a brief primer to the Minimum Description Length principle, and a short
introduction to the Sqs algorithm.
2.1</p>
      <sec id="sec-2-1">
        <title>Basic Notation</title>
        <p>The data we are working with is a bag D of d = |D| databases. A single database
D 2 D consists of |D| sequences S, and a sequence S consists of |S| consecutive
events e drawn from an alphabet I. We write ||D|| for the number of all events in
D and naturally extend this to denote the number of all events in all databases
as ||D||. Let supp(e|S) denote the number of times an event occurs in a sequence.
We refer to this as the support of an event in a sequence, and accordingly define
supp(e|D) = P
supp(e|D) = P S2 D supp(e|S) as the support of an event in a database, and</p>
        <p>D2D supp(e|D) as the support of an event in all databases. A
pattern X is a sequence of k = |X| events e 2 I . A pattern X occurs in a
sequence S if some subsequence of S is equal to X. Note that we allow gaps of
(consecutive) singleton events between the events of a pattern. We call a tuple
(i, j, X) a window for pattern X in a sequence. Indexes i and j specify start
and end of the window. A window w is minimal if no other subsequence of w
contains X.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>The MDL principle</title>
        <p>
          The Minimum Description Length (MDL) Principle is a guide to finding
structure in data. It was first introduced by Rissanen [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and attempts to achieve an
incarnation of the (incomputable) Kolmogorov complexity [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The basic idea is
that information corresponds to compression. The compressors used are called
models, and after choosing a model class M for a problem setting, the models
M 2 M can be compared with respect to their ability to compress. The better a
model compresses our data, the more information it incorporates. The principle
tells us that if we want to find the best model M for some data D, we should
pick the one that minimizes the sum
        </p>
        <p>L(M ) + L(D|M ) .</p>
        <p>The first term describes the number of bits used for the encoded model, while
L(D|M ) represents the number of bits required to encode the data using the
information gained from M . Both terms should ideally keep each other in the
balance, preventing overly complex, but expensive models, as well as cheap, but
uninformative ones. Note that we require the compression to be lossless in order
to fairly compare models of the same model class. Moreover, as in MDL we are
interested in MDL theoretical complexity we are only interested in (optimal)
code lengths, not actual instantiated code words.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Summarising a Single Database</title>
        <p>
          Tatti and Vreeken [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] recently proposed the Sqs algorithm (short for
Summarising event seQuenceS) to eciently discover good summaries of a single event
sequence database, using the MDL principle to identify the optimal model. As we
will build upon and extend the Sqs algorithm, we will give a short introduction
to its main aspects below.
        </p>
        <p>The models Sqs uses are called code tables (CT ), which are lookup tables
between patterns, and their associated codes. To ensure lossless encoding of any
data over I, all singleton events are always included.</p>
        <p>
          The data is encoded as two separate streams Cp (pattern stream) and Cg (gap
stream). The first one contains the encoded data, represented as a sequence of
codes from the second column of the code table. The second stream is needed
since we have no information on potential gaps when reading a pattern code. It
is composed of fill-codes and gap-codes, and while decoding the data we read it
in parallel to the pattern stream to fill in the gaps. For more detailed information
about Sqs we refer to the original paper [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Data D: a b d c a d b a a b c
Cp p d a q b p
Cg
gap gap
a b d c a d b a a b c
alignment p q p
CT : ab ab s n-gaps
c c gap no
d d
abc p
da q
Encoding the data For a pattern X in the code table, let us write usg(X) to
denote how often the code of this entry is used in the encoding. Intuitively, the
more often a code is used, the shorter it should be.
        </p>
        <p>
          Shannon entropy [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] tells us that the shortest prefix code of a pattern
corresponds to the negative logarithm of its probability, which in our case is its
relative occurrence. We define the length of a pattern-code as follows:
L(codep(X|CT )) =
log
✓
        </p>
        <p>usg(X)
PY 2 CT usg(Y )
◆
.</p>
        <p>The length of the pattern stream Cp can now be defined intuitively. Since
the complete data has to be covered with codes of either patterns or
singleton events, the cost of a single code is added each time this code is used.
L(Cp|CT ) = PX2 CT usg(X)L(codep(X)). The same strategy can be applied
for Cg — obviously, the gap code of a pattern should become shorter if gaps in
this pattern are likely.</p>
        <p>The overall encoded length of a database D given a code table CT is then
L(D | CT ) = LN(|D|) + X LN(|S|) + L(Cp | CT ) + L(Cg | CT )</p>
        <p>
          S2 D
where we first encode the number of sequences in the database, as well as the
length of each sequence, using the MDL optimal universal code for integers
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which is defined for n &gt; 1 as LN(n) = log⇤ (n) + log(c0), where log⇤ (n) =
log(n) + log log(n) + . . . . To make it a valid encoding, i.e. to satisfy the Kraft
inequality, we set c0 to 2.865064.
        </p>
        <p>
          Finding the optimal alignment To encode a sequence database we need an
alignment of which codes are used when. In particular, we want that alignment
that encodes our data most succinctly. Whether a pattern should be used
depends on its code length, however, which in turn depends on how often we use
it. There are exponentially many alignments for a sequence database given a
set of patterns, and hence exhaustive search is not possible. Tatti and Vreeken
therefore propose a heuristic solution, which we call SqsAlign for the
remainder of this paper. SqsAlign exploits the fact that fixing an alignment leads to
computable code lengths and vice versa. In particular, given a set of patterns,
and initializing the usages to the supports of the patterns, SqsAlign alternates
between finding the most succinct alignment given the code lengths these usages
determine, and updating the code lengths accordingly, until convergence. As the
starting values for the usages are estimates, the algorithm will only find a local
minimum, yet experiments show that in practice this allows for the discovery
of good models. For more details about the Sqs aligning scheme we refer the
reader to Algorithm 2 in the original paper [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>Mining good models This leaves us with the question of how to discover
good models. Sqs does so in a bottom-up fashion, starting with the singleton
only code table ST , and then iteratively searching for that candidate pattern
X that maximally improves the compression, until we cannot find no candidate
that improves the score and we halt. As candidates, it can either consider a
given pre-mined collection of serial episodes, or, mine good candidates on the fly
by considering combinations P Q of patterns P, Q 2 CT of already in the code
table. Evaluating the quality of every candidate is expensive, and therefore Sqs
rather first estimates the quality of a candidate, and evaluates the candidates in
estimated order of gain in compression.</p>
        <p>Limitations of Sqs for our purpose Although Sqs discovers good summaries
for single databases, it does not suce to jointly summarize multiple databases;
when we run it on the concatenation of all databases, it will likely forego local
detail, whereas when we run it on each of the individual databases it will not
be able to pick up locally infrequent yet globally important patterns. Ignoring
this, even concatenating these models will be a bad summary of the whole;
as Sqs is a heuristic it may be that two very similar databases might lead to
similar scores, yet very dissimilar pattern sets, and hence their concatenation
could produce yet another di↵erent result. This would take away our basis of
reasoning: instead of finding one set of patterns that describes well the entire set
of databases, we would end up with multiple (possibly unrelated) pattern sets
that each compress some subset of the data.</p>
        <p>Our approach is hence to instead find patterns from a unified set of
candidates, assuring consistent and comparable results.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Theory</title>
      <p>Given the preliminaries above, we can now proceed to formalize our problem. To
this end we will first specify our model class M. A single code table, like for Sqs,
will not suce in the setting of multiple databases. In our case, a model M 2 M
will be a set S of pattern sets Sj, each Sj characterizing one of the database
subsets the user is interested in. We will use the original Sqs encoding to obtain
L(Di | CT i), which means that the receiver must be able to reconstruct code
tables CT1, . . . , CTd from model M to decode each database.</p>
      <p>Let D be a set of d sequence databases over alphabet I. Furthermore, let J
= {1, . . . , d} be a set of indexes. Any index set j ⇢ J can then be identified with
a set of databases {Di 2 D | i 2 j}. On receiving a set U of index sets as input,
our goal is to find a model M that compresses D. Specifically, for each of the
j 2 U we want to find a set of patterns Sj that succinctly describe the subset of
databases j is associated with.</p>
      <sec id="sec-3-1">
        <title>Encoded length of the model</title>
        <p>We now proceed to describe all the information necessary to encode the pattern
sets Sj and the additional information needed to build a code table CTi for each
Di. To encode a pattern set Sj, we first encode the number of patterns it contains
using the universal code for integers introduced in the previous chapter. For each
pattern in Sj we encode its length as well as the events of the pattern. Patterns
are encoded like in Sqs, with the singleton only code table over the concatenated
databases, ST ⌦ , that assigns optimal Shannon codes to all singletons according
to their global support in D.</p>
        <p>L(Sj) = LN(|Sj + 1|) +</p>
        <p>X ⇣
X2 Sj</p>
        <p>LN(|X|) +</p>
        <p>X L(codep(x | ST⌦ ))</p>
        <p>⌘
x2 X
The final cost for all the pattern sets then amounts to the sum of the costs of
all the individual pattern sets.</p>
        <p>With this information it is already possible to obtain the first column of the
code tables CT i. Let ⇡ i(S) = {Sj 2 S | i 2 j} denote all pattern sets that
contain relevant non singleton patterns for a database Di, and Pi = S ⇡ i(S) the
set of patterns needed to encode Di. The entries of the first column of a code
table CT i then consist of the patterns in Pi, together with the singletons to
ensure lossless encoding.</p>
        <p>We still need to encode the information for the pattern, gap, and fill codes per
database to make this scheme work. Let usg(Pi) denote the sum of all usages
of patterns in Pi. The length of the code columns Ci for code table CT i for
database Di is then defined as</p>
        <p>L(Ci) = LN(||Di||) + log ✓||Di|| + |I|</p>
        <p>1◆
|I| 1
✓usg(Pi) + |Pi|
+ LN(usg(Pi) + 1) + log</p>
        <p>1
|Pi|
+ X LN(gaps(X) + 1) ,
1◆</p>
        <p>X2P i
where the first two terms encode the support of each singleton in a database.
This is done using a weak number composition. The second part uses the same
technique to encode the usages for each pattern. Lastly, we encode the number
of gaps for each pattern to obtain the gap codes. The fill code of a pattern can
be computed knowing its length and usage, which we do at this point. Having
defined the encoding for both the pattern sets and the code tables, the final cost
for our model can now be computed as</p>
        <p>For some database Di, it is now trivially possible to extract the corresponding
code table CTi from our model. For the leftmost column, we add the singletons
as well as the patterns in Pi which we know from S. Lastly, we fill in the codes
with the information Ci provides.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Encoded length of the data</title>
        <p>Given a code table CT i for a database Di, we can simply re-use the encoding
scheme from Sqs. The pattern codes codep(X | CT i) and gap codes codegap(X | CT i)
and codefill (X | CT i) for a database Di correspond to the respective pattern X
in CT i. Each database is encoded as two streams Cp and Cg, with
L(Cp|CT i) =
L(Cg|CT i) =</p>
        <sec id="sec-3-2-1">
          <title>X usg(X)L(codep(X|CT i))</title>
          <p>, and
X2 CTi</p>
          <p>X gaps(X)L(codegap(X|CT i)) + fills(X)L(codefill (X|CT i)) .</p>
          <p>X2 CTi
|X|&gt;1
The cost for a database Di then amounts to</p>
          <p>L(Di | CT i) = LN(|Di|) + X LN(|S|) + L(Cp | CT i) + L(Cg | CT i) .</p>
          <p>S2 Di
L(D) is then obtained by adding up the codes for each Di 2 D . This score
enables us to evaluate how well di↵erent pattern sets compress our data, and
whether a candidate should be added to the model.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Problem definition</title>
        <p>Having formalized our score, we can formally now state the problem we consider.
Minimal Serial Episode Composition Problem Let I be a finite alphabet
of events, D a bag of sequence databases over I, and U a set of index sets over
D. Further, let F be the set of all possible sequences over I of length at most
maxS2D |S|. Find that set S 2 P (F )|U| of sets of serial episodes that minimizes</p>
        <p>L(S) + L(D | S) .</p>
        <p>
          Regarding the complexity of this problem, we note that summarising a single
sequence database is already hard: the search space is exponential, and there
is no easily exploitable structure [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Unfortunately, our problem setting does
not make it easier. We now have to consider all possible episodes in D, and
all possible ways to divide them up into |U | many pattern sets. Exhaustively
searching for the optimal model is not feasible, and there is again no exploitable
structure in the search space. We therefore propose an algorithm that finds good
pattern sets on a heuristic basis.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Algorithm</title>
      <p>In the previous section we introduced MDL scores for model and data and defined
the problem we are facing. Having seen that an exact computation of the optimal
model is not feasible, we propose SqsNorm, a heuristic solution that allows us
to mine good models directly from data.</p>
      <p>Estimating candidates We denote the number of bits needed to encode
database Di with code table CT i as L(Di, CT i). Let us write L (Di, CT i X)
for the gain of adding a pattern X to the code table of a database Di.</p>
      <p>L (Di, CT i</p>
      <p>X) = L(Di, CT i)</p>
      <p>L(Di, CT i</p>
      <p>X)
To compute this di↵erence we use the Sqs encoding and compare the scores. We
have seen in the previous sections that Sqs estimates the gain of a candidate
and chooses the best extension P X for a pattern P based on this estimate.</p>
      <p>We propose a slightly altered version of the estimate algorithm used in Sqs.
Instead of returning only the extension with the highest gain dX it returns the set
of all possible extensions for P together with their scores. Its outline is shown in
Algorithm 1. This tweak allows us to compare scores of candidates over several
Algorithm 1: Estimates(D, A, P )
input : Database Di, current alignment Ai, pattern P 2 CT i
output: All possible candidates P X and their estimated gain dX
1 for windows v of P in Ai do
2 for windows w of X that occur after v do
3 dX updateGain(P, X);
4 return {(X, dX ) | X 2 CT i};
databases. Let us denote the estimated gain achieved by adding a pattern P
somewhere in our model as Lˆ (D, S P ). Our goal is then to find candidates
with a high Lˆ (D, S P ).</p>
      <p>For any candidate pattern P X where P and X are drawn from P [ I we
define a function Lˆ (Di, CT i P X) to describe the estimated gain when PX
is added to the code table CT i for Di. This is not equivalent to dX , because dX
is only defined for entries P and X that have at least one active window in Di.
We set this new score to zero for patterns that either are not relevant for the
given database, or are expected to have a negative gain.</p>
      <p>Lˆ (Di, CT i</p>
      <p>P X) =
⇢ dX if (X, dX ) 2 Estimates(Di, Ai, P ) ^ dX
0 otherwise.
0
Algorithm 2 gives an outline of how to find a good extension X for a given
pattern P in the multiple database setting.
4 for X 2 P</p>
      <p>do
5
gainX</p>
      <sec id="sec-4-1">
        <title>P candsi⇣Lˆ (Di, CTi</title>
        <p>P X)⌘
6 return P X with highest gainX ;</p>
        <p>For simplicity, we assume that a pattern will be added to all databases where
it has a positive gain dX , even though an index set representing these databases
might not even exist in U . The task of choosing the right Sj for a pattern is
addressed after the estimating process.</p>
        <p>The SqsNorm Algorithm Since we can now create a list of candidates for our
model, we proceed by computing their actual value. Since this requires us to call
SqsAlign for each candidate and database, we only want to do an expensive
computation if it is possible for the candidate to exist. Consider the following
function :
(Di, P ) =
⇢ 1 if 9 P1, P2 2 CTi : P = P1P2</p>
        <p>0 otherwise.</p>
        <p>
          We give the pseudo-code of SqsNorm as Algorithm 3. We first get the
preliminary alignments by using SqsAlign without any patterns (line 1), and then
start by creating a batch of candidates (3) that we order by their estimated gain.
For each database in which a candidate may occur (5–6), we compute the code
lengths to obtain their actual gain (7). The SetCover section is inherited from
DiffNorm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]: The set w contains the gain of a candidate P X for each index
set j as a weight (10), and we choose to add P X to those Sj that maximize
the gain (11–12). Since adding a new pattern to a code table might change the
usages of other patterns, or even make them redundant (imagine big bad
becoming dispensable as soon as big bad wolf has been found), each candidate check is
followed by a pruning step (13), in which we re-evaluate all patterns in the model
one by one, permanently removing them if this gives a gain in compression.
Complexity Last, we consider the computational complexity of SqsNorm.
Since Estimate takes O(|Pi| + |I| + ||Di||) steps [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], GetCandidate has
a complexity of O(d ⇥ (|Pi| + |I| + ||Di||)). The overall number of patterns
S in our model can be bounded by the number F of frequent serial episodes
over I. O(Sqs) for one database Di is O(|Pi| ⇥ || Di||) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and pruning takes
Algorithm 3: SqsNorm(D, U )
input : Bag of d databases D, set of index sets U
output: Model S= {S1, . . . , S|U|} that compresses D well
        </p>
        <p>; ; S {; | j 2 U }; A { Sqs(Di, ; ) | Di 2 D}
F { GetCandidate(D, A ,P) | P 2 P , e 2 I}
for P X 2 F ordered by gainX do
for Di 2 D do
if (Di, P X) then</p>
        <p>gaini L (Di, Ci P X)
1 P
2 do
3
4
5
6
7
{ max(0, Pi2 j gaini | j 2 U };</p>
        <p>WeightedSetCover(J , U, w);
S with P X added to every Sj with j 2 U 0 and positive wj;</p>
        <p>
          Prune(D,S,S0);
O(|S|2 ⇥|D| ) time [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. All in all, SqsNorm executes GetCandidate, SqsAlign
and Prune at most |F | times, which results in a worst case complexity of
O(|F | ⇥ ⇣(|F | ⇥ ||D|| ) + (|F | + ||D|| + d ⇥ |I| ) + (|F |2 ⇥ |D| )⌘) &lt; O(|F |3 ⇥ ||D|| ).
In practice, SqsNorm converges quickly, only evaluating few carefully chosen
candidates, calling SqsAlign only called where it makes sense to do so, and the
model is pruned to have only relevant patterns in the model at all times—in our
experiments SqsNorm takes up to a few minutes on reasonably sized data.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        Discovering interesting patterns from sequential data has a long history.
Traditional approaches focus on mining all patterns that satisfy a local
intererestingness constraint. In sequential data, however, there exist multiple reasonable
definitions for support. Laxman et al. proposed to count the largest set of
nonoverlapping windows using finite state automata, tracking the frequencies of
each episode with an automaton [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Their method discovers serial and parallel
episodes. Mannila at al. propose counting minimal windows, starting with
singletons and building episodes of increasing length [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The algorithm relies on
the fact that a frequent episode implies that its sub-episodes are frequent and
requires window size and minimal support as user parameters. Tatti et al.
introduce strict episodes and define a subset relation between them, using directed
acyclic graphs to mine pattern candidates [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. This approach also generates
candidate episodes from already discovered patterns.
      </p>
      <p>
        A related class of methods aims to discover those patterns that are prevalent
in one class, and rare in the other [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], or to mine such discriminative patterns
directly from data for classification purposes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Quiniou et al. consider multiple
text corpora with the goal of discovering typical linguistic patterns for each of
them [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Our objective, on the other hand, is not to discriminate between
databases nor to describe them independently, but to jointly characterize sets of
databases the user is interested in.
      </p>
      <p>
        In this paper we do not consider mining all patterns, but rather return a
small set of patterns that together describe the given databases well. As far as
pattern set mining is concerned, Krimp [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] made the first step toward using
MDL in market basket analysis, selecting a small set of patterns that compress
a single database well out of a possibly huge set of frequent patterns. Slim [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
later refined this approach, mining candidates directly from the data instead of
depending on an input set of candidates.
      </p>
      <p>
        In this paper we build upon Sqs [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], an MDL-based heuristic to mine
descriptive sets of serial episodes for a single sequence database. To encode the
data, it uses look-up tables for patterns and codes, and considers candidates
either from a pre-mined pool, or by iteratively considering combinations of
patterns in the current model. While this approach works well for a single dataset,
it cannot trivially be extended for multiple databases, since the heuristic nature
of the algorithm would not produce consistent candidates for each database.
      </p>
      <p>
        In a more recent approach, Squish enriches the concept of Sqs with a broader
pattern language [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In addition to serial episodes, Squish allows for choice
episodes, where a position in an episode may be filled with one of a predefined
choice of events. In addition, Squish adopts a greedy heuristic to find good
covers of the data. It will be interesting to see to what extent these ideas can be
carried over into SqsNorm.
      </p>
      <p>
        DiffNorm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is the other inspiration of this work. Its goal is to find
di↵erence and norm among itemset databases with the help of MDL, which we here
extend to the setting of sequential data; unlike SqsNorm, DiffNorm does not
consider order, repetitions and possible gaps, and as a result neither its score
nor search algorithm can be trivially applied, or adapted to sequential data.
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>Next we proceed to empirically evaluate SqsNorm. To this end we consider
both synthetic and real world data. As real world data, for interpretability of
the results, we will consider text data. All experiments were run single-threaded
on a quad-core Intel machine with 8GB of memory, running windows.
6.1</p>
      <sec id="sec-6-1">
        <title>Synthetic data</title>
        <p>All synthetic datasets feature 10, 000 events per database. Events are drawn
from alphabet I = {0, . . . , 500}. Index set U includes pattern sets S1, . . . , Sd for
# L
# G</p>
        <p>L(D, S0)</p>
        <p>L%</p>
      </sec>
      <sec id="sec-6-2">
        <title>Dataset</title>
        <p>|S| [
Random 0 0 461 180 100 0 0 0 0 34
Local 50 0 460 612 97 50 50 0 0 161
Global 0 10 460 540 97 10 10 0 0 111
Mixture 10 90 906 089 71 100 100 0 0 765
Table 1: Results on synthetic datasets. From left to right: number of locally (# L)
and globally (# G) planted patterns, length using the empty model (L(D, S0)),
compression rate (L% = LL((DD,,SS0)) ), total number of patterns (|S|), exactly
recovered patterns assigned to the correct Sj (=), concatenations of patterns ([ ),
patterns that are unrelated to any planted patterns (?).
?</p>
        <p>time(s)
=
each individual database as well as a global pattern set S⌦ . In the following we
describe four basic datasets we used to ascertain the most important properties
of SqsNorm.</p>
        <p>Random To make sure our algorithm does not pick up noise, we create the
Random dataset, consisting of five databases. It contains no information, since
events are picked uniformly at random.</p>
        <p>Local For this dataset, we set up five databases and plant ten unique patterns
per database (all in all 50 unique patterns: (1, . . . , 5) to (246, . . . , 250)). In this
case we would like to find all the planted patterns in their respective Sj, while
S⌦ has to stay empty.</p>
        <p>Global We then proceed by planting 10 global patterns in five databases.
Patterns consist of randomly chosen events from I, which allows for arbitrary
patterns that might have repetitions or shared events. This time, the pattern should
be identified as typical for all databases and put in S⌦ .</p>
        <p>Mixture In the Mixture set we run SqsNorm on a set of 10 databases and
plant a total of 100 patterns, 90 of which are global, while each database retains
exactly one local pattern. Patterns are created exactly as for the global set.
We now proceed to test SqsNorm on some real world datasets. We first explain
the data structure and preprocessing methods and then present some results.
Addresses per Era First, we regard the Addresses dataset. It consists of the
presidential inaugural addresses from George Washington to Donald Trump.
The individual words are stemmed and stopwords are removed. Each word in</p>
      </sec>
      <sec id="sec-6-3">
        <title>Dataset</title>
        <p>the resulting corpus is seen as an event, and we group the addresses into eight
eras of American history, regarding each of these eras as a database and the
individual addresses as sequences. Figure 2 shows some results for each era and
S⌦ .</p>
        <p>Addresses individual For comparison, we then treat each inaugural address
as a database and run SqsNorm again. We would expect S⌦ to shrink, since
57 individual inaugural addresses have probably less patterns in common than
entire eras from the previous experiment. Also, it would be natural for the
individual Sj to become smaller, since a single address is not likely to have more than
two or three recurring slogans. The experiments confirm this e↵ect. S⌦ shrinks,
and the average number of patterns per individual address goes down to about
1. This often corresponds to one special catchphrase or slogan used during the
address. Interestingly, these slogans were often not found in the Era dataset,
because it is too costly to call such a pattern era-typical when it only occurs
during one speech. Some patterns that were only found on an address-wise run
of SqsNorm are deliver up (Lincoln), new spirit (Carter), put out [my] hand
(Bush) and make America [. . . ] again (Trump).</p>
        <p>LOTR1-12 To obtain more meaningful results, we analyze the first twelve
chapters of J.R.R. Tolkien’s Lord of the Rings: The Fellowship of the Ring. The text is
stemmed and stopwords are removed. We regard individual chapters as sequence
databases containing a single sequence. Quite correctly, SqsNorm find general
themes like “Mr Bilbo” or “Sam Gamgee” in S⌦ . Characteristics of each chapter
include chapter-bound events like Bilbo’s eleventy-first birthday, and new
concepts that are introduced and explained during a chapter such as the one ring.
Figure 3 shows patterns found for selected chapters.</p>
        <p>The Hobbit We then proceed to explore J.R.R Tolkien’s book The Hobbit in
the same way. Again, the patterns found conform to our expectations. Local
patterns correspond to specific scenes or places (dining room, birthday present,
lake town), while the main characters’ names and concepts that weave through
the entire story (King under the Mountain, Mr Baggins, Fili Kili,. . . ) are found
in S⌦ .
Inland Frontier</p>
        <p>Foreign Nation
Interstate Commerc[e]
Race Feel</p>
        <p>Nuclear Weapon
Both Sides
today
1789</p>
        <p>Civil Servic[e]
Social Order
Eighteenth Amend[ment]
Western Hemisphere
21st Centur[y]
Health Care
fellow citizen
Songs To evaluate SqsNorm on a somewhat di↵erent scale, we finally test it
on a collection of four songs taken from the top of the charts around Christmas.
Each song is understood as a database containing a single sequence. Instead of
comparing them word by word, we choose to analyze them letter by letter, with
I = {a, . . . , z} as our alphabet. This results on the one hand in global stopwords,
like and and this, which do not contribute much interesting information. We find
one non-stopword pattern in S⌦ : Love is a common pattern for all chosen songs.
On the other hand we find long pieces of refrains (up to 32 letters) in the local
pattern sets, often catching the most memorable bits of the song.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>The experiments show that SqsNorm correctly returns good sets of informative
patterns. By analyzing synthetic datasets we could show that true patterns are
correctly found up to very high gap levels. Furthermore, global patterns are
correctly identified as such while noise is not picked up at all. Results on real
datasets are interesting and plausible, and we discover universally used phrases
and themes alike. By experimenting with the Songs dataset we have seen that
SqsNorm can also be applied on a character-wise level. It picks up large parts of
the songs’ main refrains, as well as frequently used stopwords and even identifies
love as a common theme they share.</p>
      <p>good bye
one hundred forty four
chubb bolger bracegirdl brockhous
proudfoot
young hobbit
eleventy one
land [of] mordor
gladden field
one ring
black rider
mr peregrin
golden perch
short cut
tom bombadil
ring ding dillo
nightly noise
hill top
tin uviel
gil galad
hill top
answer[ed] strider
house elrond</p>
      <p>CH1
CH2
CH4
CH7
CH11
CH12
mr baggins
long ago
brandy hall
mr bilbo
sam gamgee
bag end</p>
      <p>
        Despite these very encouraging results, there are several possibilities to
further improve SqsNorm. Candidate estimation can be trivially parallelized by
concurrently evaluating candidates. Moreover, since estimates for each database
are calculated independently, all of these computations can be parallelized on a
database-level as well. Lastly, we are currently using prefix codes for our
encoding. There is a more refined approach called prequential coding, which allows
us to transmit the data without having to encode the usages of the patterns
explicitly, and hence helps avoid any possible bias in the encoding of the model.
The key idea of prequential coding is that, starting from a uniform distribution
over the usages, we can update these counts with every code that we receive, and
so rapidly approximate the true usage distribution. This has been successfully
used in DiffNorm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and while our results show it is not necessary to obtain
meaningful results, it will be interesting to see if these improve when we do.
8
      </p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>In this paper we studied the problem of discovering di↵erence and norm between
sequence databases. In order to do so, we proposed SqsNorm, an algorithm that
uses MDL to find sets of patterns that represent both local and global features of
the data. Drawing inspiration from Sqs and DiffNorm, we defined an encoding
for our data and proposed a way to mine good candidates directly from data.
Evaluating SqsNorm, we have seen that our system works well in practice, does
not pick up noise and is fast. SqsNorm allows to extract potentially
interesting information with regard to what is the norm, and what are the di↵erences
between sequences databases in general, and as we have shown, text data in
particular. In the future we plan to further explore its application for exploratory
analysis in computational linguistics.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgements</title>
      <p>The authors are supported by the Cluster of Excellence “Multimodal Computing
and Interaction” within the Excellence Initiative of the German Federal
Government.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vreeken</surname>
          </string-name>
          .
          <article-title>Eciently summarising event sequences with rich interleaving patterns</article-title>
          .
          <source>In SDM. SIAM</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>K.</given-names>
            <surname>Budhathoki</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Vreeken.</surname>
          </string-name>
          <article-title>The di↵erence and the norm - characterising similarities and di↵erences between databases</article-title>
          .
          <source>In ECML PKDD</source>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. H. Cheng,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          , J. Han, and
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Direct discriminative pattern mining for e↵ective classification</article-title>
          .
          <source>In ICDE</source>
          , pages
          <fpage>169</fpage>
          -
          <lpage>178</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Cover</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Thomas</surname>
          </string-name>
          . Elements of Information Theory. WileyInterscience New York,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Fowkes</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          .
          <article-title>A subsequence interleaving model for sequential pattern mining</article-title>
          .
          <source>In KDD</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>X.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Dong</surname>
          </string-name>
          .
          <article-title>Mining minimal distinguishing subsequence patterns with gap constraints</article-title>
          .
          <source>Knowl</source>
          . Inf. Syst.,
          <volume>11</volume>
          (
          <issue>3</issue>
          ):
          <fpage>259</fpage>
          -
          <lpage>286</lpage>
          ,
          <year>April 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Laxman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Sastry</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Unnikrishnan</surname>
          </string-name>
          .
          <article-title>A fast algorithm for finding frequent episodes in event streams</article-title>
          .
          <source>In KDD</source>
          , pages
          <fpage>410</fpage>
          -
          <lpage>419</lpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vita</surname>
          </string-name>
          <article-title>´nyi. An Introduction to Kolmogorov Complexity and its Applications</article-title>
          . Springer,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>H.</given-names>
            <surname>Mannila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Toivonen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Inkeri</given-names>
            <surname>Verkamo</surname>
          </string-name>
          .
          <article-title>Discovery of frequent episodes in event sequences</article-title>
          .
          <source>Data Min. Knowl. Discov.</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ):
          <fpage>259</fpage>
          -
          <lpage>289</lpage>
          ,
          <year>January 1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. S. Quiniou,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cellier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Charnois</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Legallois</surname>
          </string-name>
          .
          <article-title>What about sequential data mining techniques to identify linguistic patterns for stylistics? In CICLing</article-title>
          , pages
          <fpage>166</fpage>
          -
          <lpage>177</lpage>
          . Springer-Verlag,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>J.</given-names>
            <surname>Rissanen</surname>
          </string-name>
          .
          <article-title>Modeling by shortest data description</article-title>
          .
          <source>Automatica</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <fpage>465</fpage>
          -
          <lpage>471</lpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>J.</given-names>
            <surname>Rissanen</surname>
          </string-name>
          .
          <article-title>A universal prior for integers and estimation by minimum description length</article-title>
          .
          <source>Annals Stat.</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ):
          <fpage>416</fpage>
          -
          <lpage>431</lpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>K.</given-names>
            <surname>Smets</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Vreeken</surname>
          </string-name>
          . Slim:
          <article-title>Directly mining descriptive patterns</article-title>
          .
          <source>In SDM</source>
          , pages
          <fpage>236</fpage>
          -
          <lpage>247</lpage>
          . SIAM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>N.</given-names>
            <surname>Tatti</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Cule</surname>
          </string-name>
          .
          <article-title>Mining closed episodes with simultaneous events</article-title>
          .
          <source>In KDD</source>
          , pages
          <fpage>1172</fpage>
          -
          <lpage>1180</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>N.</given-names>
            <surname>Tatti</surname>
          </string-name>
          and
          <string-name>
            <surname>J. Vreeken.</surname>
          </string-name>
          <article-title>The long and the short of it: Summarizing event sequences with serial episodes</article-title>
          .
          <source>In KDD</source>
          , pages
          <fpage>462</fpage>
          -
          <lpage>470</lpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>J. Vreeken</surname>
            ,
            <given-names>M. van Leeuwen</given-names>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Siebes</surname>
          </string-name>
          . Krimp:
          <article-title>Mining itemsets that compress</article-title>
          .
          <source>Data Min. Knowl. Disc.</source>
          ,
          <volume>23</volume>
          (
          <issue>1</issue>
          ):
          <fpage>169</fpage>
          -
          <lpage>214</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>