Mining Sequences in Phone Recordings with
Answer Set Programming
Francesca A. Lisi1,* , Gioacchino Sterlicchio2
1
    Dipartimento di Informatica, University of Bari “Aldo Moro”, Via E. Orabona 4, Bari, 70125, Italy
2
    Dept. Mechanical Engineering, Polytechnic University of Bari, Via G. Amendola 126/b - 70126 Bari, Italy


                                         Abstract
                                         The analysis of phone recordings is one of the activities typically performed in the Digital Forensics
                                         practice. It provides information such as the geographical position of some suspect, useful to recon-
                                         struct the network of contacts. In this work we have considered the problem of analyzing a dataset of
                                         anonymized phone recordings made available within the DigForASP project by looking for sequential
                                         patterns in the recordings. The approach followed leverages the expressive and inferential power of the
                                         Answer Set Programming paradigm. The experiments reported here concern the extraction of closed
                                         and maximal patterns (condensed representations), which significantly reduces the number of returned
                                         patterns without loss of information.

                                         Keywords
                                         Sequential Pattern Mining, Answer Set Programming, Digital Forensics.


1. Introduction
In the context of the forensic practice, material found in digital devices, often in relation to
mobile devices and computer crime, is typically a subject of investigation. The identification,
acquisition, preservation, analysis and presentation of this material, by means of specialized
software, and according to specific regulations, are indeed the phases of the procedure usually
followed by professionals of Digital Forensics (DF). In particular, the phase of so-called Evidence
Analysis involves the examination and aggregation of evidence about possible crimes and crime
perpetrators collected from various electronic devices in order to reconstruct events, event
sequences and scenarios related to a crime. The results from this phase are then communicated
to the other actors in the forensic context, such as law enforcement bodies, lawyers and judges.
   Recently, it has been suggested that the phase of Evidence Analysis has particular require-
ments that make the use of logic-based AI techniques, especially from the areas of Knowledge
Representation and Automated Reasoning, a much more promising approach, with the potential
of becoming a breakthrough in the state-of-the-art of research in the field and of introducing

HYDRA - RCRA 2022: 1st International Workshop on HYbrid Models for Coupling Deductive and Inductive ReAsoning
and 29th RCRA workshop on Experimental evaluation of algorithms for solving problems with combinatorial explosion
*
  Corresponding author, affiliated also to the Centro Interdipartimentale di Logica e Applicazioni (CILA) of the
  University of Bari.
$ FrancescaAlessandra.Lisi@uniba.it (F. A. Lisi); ninnisterlicchio@gmail.com (G. Sterlicchio)
 http://www.di.uniba.it/~lisi/ (F. A. Lisi)
 0000-0001-5414-5844 (F. A. Lisi); 0000-0002-2936-0777 (G. Sterlicchio)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
disruptive innovations in the forensic practice [1, 2]. The ultimate goal of Evidence Analysis is
indeed the formulation of verifiable evidence that can be rationally presented in a trial. Under
this perspective, the results provided by, e.g., ML classifiers or other types of “black box” AI
systems can be regarded as having the same value as human witness’ suspicions. Therefore, they
could not be used as legal evidence. Conversely, logic-based AI methods provide a broad range
of proof-based reasoning functionalities that can be implemented in a declarative framework
where the problem specification and the computational program are closely aligned. This has
the benefit that the correctness of the resulting systems can be formally verified. Moreover,
new methods for visualising and explaining the computed answers (e.g., based on graphical
languages) have been recently proposed. So one can not only model and solve relevant problems,
but also effectively communicate the conclusions (and their proofs) in a transparent, comprehen-
sible and justified way. This approach is the one adopted by the COST Action “Digital forensics:
evidence analysis via intelligent systems and practices" (DigForASP)1 , a large international
cooperation network which aims at promoting formal and verifiable AI methods and techniques
for Evidence Analysis [3].
   A typical DF problem is the analysis of phone recordings. Indeed, during the investigation of
a crime, it is common to analyze the communications of a particular suspect. Since nowadays
mobile phones are objects owned by anyone, it can be useful for investigators to analyze the
calls or messages exchanged. The telephone records contain all the traces of communications
(calls, SMS, and all the data traffic) concerning a specific user over a certain period of time. Note
that phone records do not trace important data such as the audio of calls sent or received. In
fact, they only provide a trace of the communication that has taken place but not its content.
The phone records can be requested by the Judicial Authority if deemed useful in order to
carry out investigations involving the individual owner of the phone. Correctly analyzing
the telephone records is essential to obtain useful hints. Depending on the analysis, different
types of information can be extracted. The records are typically analyzed for comparing the
geographical positions with respect to the declarations, and for reconstructing the network of
contacts of a single user in order to trace which conversations (s)he has had with whom, where
and when. In this paper we consider the problem of mining sequences in phone recordings in
order to discover frequent sequential patterns. In particular, we propose a declarative approach
based on Answer Set Programming (ASP) which adapts previous work in sequential pattern
mining [4, 5] to the problem in hand. The proposed approach is in the spirit of DigForASP and
appears a promising support to the analysis of events and sequences of events in scenarios of
interest to DF experts. This paper extends the preliminary work reported in [6] with a significant
refinement aimed at addressing some limits of pattern mining such as the size of the output. In
particular, here we explore the use of condensed representations for sequential patterns.
   The paper is organized as follows. In Section 2 we provide the necessary preliminaries on ASP
and sequential pattern mining. In Section 3 we describe the specific case study for the analysis
of mobile phone recordings. In Section 4 we report the details of our ASP-based approach,
including the ASP encodings, and the experimental results obtained by comparing the behaviour
of the encodings. In Section 5 we conclude by commenting the ongoing work and by outlining
some promising directions for research.

1
    https://digforasp.uca.es/
2. Preliminaries
2.1. Answer Set Programming
In the following we give a brief overview of the syntax and semantics of disjunctive logic
programs in ASP. The reader can refer to, e.g, [7] for a more extensive introduction to ASP.
   Let 𝑈 be a fixed countable set of (domain) elements, also called constants, upon which a
total order ≺ is defined. An atom 𝛼 is an expression 𝑝(𝑡1 , . . . , 𝑡𝑛 ), where 𝑝 is a predicate of
arity 𝑛 ≥ 0 and each 𝑡𝑖 is either a variable or an element from 𝑈 (i.e, the resulting language is
function-free). An atom is ground if it is free of variables. We denote the set of all ground atoms
over 𝑈 by 𝐵𝑈 . A (disjunctive) rule 𝑟 is of the form

                            𝑎1 ∨ . . . ∨ 𝑎𝑛 ← 𝑏1 , . . . , 𝑏𝑘 , 𝑛𝑜𝑡 𝑏𝑘+1 , . . . , 𝑛𝑜𝑡 𝑏𝑚

with 𝑛 ≥ 0, 𝑚 ≥ 𝑘 ≥ 0, 𝑛 + 𝑚 > 0, where 𝑎1 , . . . , 𝑎𝑛 , 𝑏1 , . . . , 𝑏𝑚 are atoms, or a count
expression of the form #𝑐𝑜𝑢𝑛𝑡{𝑙 : 𝑙1 , . . . , 𝑙𝑖 } ◁▷ 𝑢, where 𝑙 is an atom and 𝑙𝑗 is a literal (i.e, an
atom which can be negated or not), 1 ≥ 𝑗 ≥ 𝑖, ◁▷∈ {≤, <, =, >, ≥}, and 𝑢 ∈ N. Moreover,
“not” denotes default negation. The head of 𝑟 is the set ℎ𝑒𝑎𝑑(𝑟) = {𝑎1 , . . . , 𝑎𝑛 } and the body
of 𝑟 is 𝑏𝑜𝑑𝑦(𝑟) = {𝑏1 , . . . , 𝑏𝑘 , 𝑛𝑜𝑡 𝑏𝑘+1 , . . . , 𝑛𝑜𝑡 𝑏𝑚 }. Furthermore, we distinguish between
𝑏𝑜𝑑𝑦 + (𝑟) = {𝑏1 , . . . , 𝑏𝑘 } and 𝑏𝑜𝑑𝑦 − (𝑟) = {𝑏𝑘+1 , . . . , 𝑏𝑚 }. A rule 𝑟 is normal if 𝑛 ≤ 1 and a
constraint if 𝑛 = 0. A rule 𝑟 is safe if each variable in 𝑟 occurs in 𝑏𝑜𝑑𝑦 + (𝑟). A rule 𝑟 is ground
if no variable occurs in 𝑟. A fact is a ground rule with 𝑏𝑜𝑑𝑦(𝑟) = ∅ and |ℎ𝑒𝑎𝑑(𝑟)| = 1. An
(input) database is a set of facts. A program is a finite set of rules. For a program Π and an input
database 𝐷, we often write Π(𝐷) instead of 𝐷 ∪ Π. If each rule in a program is normal (resp.
ground), we call the program normal (resp. ground).
   Given a program Π, let 𝑈Π be the set of all constants appearing in Π. 𝐺𝑟(Π) is the set
of rules 𝑟𝜎 obtained by applying, to each rule 𝑟 ∈ Π, all possible substitutions 𝜎 from the
variables in 𝑟 to elements of 𝑈Π . For count-expressions, {𝑙 : 𝑙1 , . . . , 𝑙𝑛 } denotes the set of all
ground instantiations of 𝑙, governed through 𝑙1 , . . . , 𝑙𝑛 . An interpretation 𝐼 ⊆ 𝐵𝑈 satisfies
a ground rule 𝑟 iff ℎ𝑒𝑎𝑑(𝑟) ∩ 𝐼 ̸= ∅ whenever 𝑏𝑜𝑑𝑦 + (𝑟) ⊆ 𝐼, 𝑏𝑜𝑑𝑦 − (𝑟) ∩ 𝐼 = ∅, and for
each contained count-expression, 𝑁 ◁▷ 𝑢 holds, where 𝑁 = |{𝑙|𝑙1 , . . . , 𝑙𝑛 }|, 𝑢 ∈ N and
◁▷∈ {≤, <, =, >, ≥}. A ground program Π is satisfied by 𝐼, if 𝐼 satisfies each 𝑟 ∈ Π. A
non-ground rule 𝑟 (resp., a program Π) is satisfied by an interpretation 𝐼 iff 𝐼 satisfies all
groundings of 𝑟 (resp., 𝐺𝑟(Π)). A subset-minimal set 𝐼 ⊆ 𝐵𝑈 satisfying the Gelfond-Lifschitz
reduct Π𝐼 = {ℎ𝑒𝑎𝑑(𝑟) ← 𝑏𝑜𝑑𝑦 + (𝑟)|𝐼 ∩ 𝑏𝑜𝑑𝑦 − (𝑟) = ∅, 𝑟 ∈ 𝐺𝑟(Π)} is called an answer set of
Π. We denote the set of answer sets for a program Π by 𝐴𝑆(Π).
   The tools used in this work are part of the Potassco2 collection [8]. The main tool of the
collection is the clingo ASP solver [9].

2.2. Sequential Pattern Mining
Our terminology on sequence mining follows the one in [10]. Throughout this article, [𝑛] =
{1, . . . , 𝑛} denotes the set of the first 𝑛 positive integers.

2
    https://potassco.org/
Table 1
An example of sequence database 𝒟
                                            Id   Sequence
                                            1    ⟨𝑑 𝑎 𝑏 𝑐⟩
                                            2    ⟨𝑎 𝑐 𝑏 𝑐⟩
                                            3     ⟨𝑎 𝑏 𝑐⟩
                                            4     ⟨𝑎 𝑏 𝑐⟩
                                            5      ⟨𝑎 𝑐⟩
                                            6       ⟨𝑏⟩
                                            7       ⟨𝑐⟩


   Let Σ be the alphabet, i.e, the set of items. An itemset 𝐴 = {𝑎1 , 𝑎2 , . . . , 𝑎𝑚 } ⊆ Σ is a finite
set of items. The size of 𝐴, denoted |𝐴|, is 𝑚. A sequence 𝑠 is of the form 𝑠 = ⟨𝑠1 𝑠2 . . . 𝑠𝑛 ⟩
where each 𝑠𝑖 is an itemset, and 𝑛 is the length of the sequence. A database 𝒟 is a multiset
of sequences over Σ. A sequence 𝑠 = ⟨𝑠1 . . . 𝑠𝑚 ⟩ with 𝑠𝑖 ∈ Σ is contained in a sequence
𝑡 = ⟨𝑡1 . . . 𝑡𝑛 ⟩ with 𝑚 ≤ 𝑛, written 𝑠 ⊑ 𝑡, if 𝑠𝑖 ⊆ 𝑡𝑒𝑖 for 1 ≤ 𝑖 ≤ 𝑚 and an increasing sequence
(𝑒1 . . . 𝑒𝑚 ) of positive integers 𝑒𝑖 ∈ [𝑛], called an embedding of 𝑠 in 𝑡. For example, we have
⟨𝑎(𝑐𝑑)⟩ ⊑ ⟨𝑎𝑏(𝑐𝑑𝑒)⟩ relative to embedding (1, 3). Here, (𝑐𝑑) denotes the itemset made of items
𝑐 and 𝑑.
   Given a database 𝒟, the cover of a sequence 𝑝 is the set of sequences in 𝒟 that contain 𝑝:
𝑐𝑜𝑣𝑒𝑟(𝑝, 𝒟) = {𝑡 ∈ 𝐷|𝑝 ⊑ 𝑡}. The number of sequences in 𝒟 containing 𝑝 is called its support,
that is, 𝑠𝑢𝑝𝑝(𝑝, 𝒟) = |𝑐𝑜𝑣𝑒𝑟(𝑝, 𝒟)|. For an integer 𝑡ℎ, the problem of frequent sequence mining
is about discovering all sequences 𝑝 such that 𝑠𝑢𝑝𝑝(𝑝, 𝒟) ≥ 𝑡ℎ. We often call 𝑝 a (sequential)
pattern, and 𝑡ℎ is also referred to as the (minimum) support threshold. For 𝑡ℎ = 2 we can see
how ⟨𝑎⟩, ⟨𝑏⟩, ⟨𝑐⟩, ⟨𝑎 𝑏⟩, ⟨𝑎 𝑐⟩, ⟨𝑏 𝑐⟩ and ⟨𝑎 𝑏 𝑐⟩ are common patterns in the database 𝒟 reported
in Table 1.
   The ASP encoding for sequential pattern mining considered in this paper follows the principles
outlined in [11] and [4]. In particular, 𝒟 is represented as a collection of ASP facts seq(t, p, e),
where the seq predicate says that an item e occurs at position p in a sequence t. Also, there are
two parameters to be defined. Besides th, maxlen determines the maximum pattern length. The
lower the value of th the more patterns will be extracted; the lower the value of maxlen, the
smaller the ground program will be. Therefore the parameters allow a tuning for the program
efficiency. Finally, each answer set comprises a single pattern of interest. More precisely, an
answer set represents a frequent pattern 𝑠 = ⟨𝑠𝑖 ⟩𝑖≤𝑡ℎ≤𝑚 such that 1 ≤ 𝑚 ≤ 𝑚𝑎𝑥𝑙𝑒𝑛 from
atoms 𝑝𝑎𝑡(1, 𝑠1 ), ..., 𝑝𝑎𝑡(𝑚, 𝑠𝑚 ). The first argument expresses the position of the object in
increasing order, where m can vary, while 1 always indicates the first item in the pattern. For
example the atoms pat(1, a), pat(2, b) and pat(3, c) describe a frequent pattern ⟨𝑎 𝑏 𝑐⟩ of the
database in Table 1.
   For a thorough discussion of the program the reader can refer to [5].
3. The DF case study
As a case study for our application of sequential pattern mining in DF, we have considered a
dataset that has been made available by Prof. David Billard (University of Applied Sciences in
Geneva) under NDA to DigForASP members for academic experimentation. In the following
we provide details of the structure of the dataset and of the pre-processing which is necessary
in order to put the dataset into a more suitable format for the application in hand.

3.1. The DigForASP dataset
The dataset consists of the telephone records of four users from a real-word investigative case.
Each file in the dataset has the following schema:

       • Type: what kind of operation the user has performed (e.g, incoming/outgoing call or
         SMS);
       • Caller: who makes the call or sends an SMS;
       • Callee: who receives the call or SMS;
       • Street: where the operation has taken place;
       • Time: when the operation has taken place (ISO format3 HH: MM: SS);
       • Duration: how long the operation has been (ISO format HH: MM: SS);
       • Date: when the operation has taken place (format: day, month, year).

The type of the operation is one of the following cases: “config”, “gprs”, “redirect”,
“out_sms(SUB_TYPE)”, “in_sms(SUB_TYPE)”, “out_call(SUB_TYPE)”, “in_call(SUB_TYPE)”. Sub-
types are: “simple”, “ack”, “foreign”.
   The dataset has undergone the mandatory anonymization process for reasons of privacy and
confidentiality. Therefore it does not contain data that allows tracing back to the real people
involved in the investigative case. For instance, there is no phone number for the caller/callee
but only a fictitious name. The names and the sizes (# rows) of the four files in the dataset are
the following: Eudokia Makrembolitissa (8,783), Karen Cook McNally (20,894), Laila Lalami
(12,689), and Lucy Delaney (8,480).

3.2. Data pre-processing
The DigForASP dataset in its original format can not be considered as a set of sequences.
Therefore, it must undergo an intermediate transformation. In short, each line of the original
dataset is transformed into an ASP fact through the seq_event atom.
   The procedure for transforming the original dataset into sequences of ASP facts is the
following. Each row of the dataset has been transformed into a fact seq_event(t, p, e) (Listing 1),
where e represents the item (in our case the event), p defines the position of e within the
sequence t (identified by date). The term p is important as it allows you to define the order
of events within a sequence. More specifically, e is made up of the following features: Type,
Caller, Callee, Street_a and Street_b for the geo-location of the event, the (hour, minute, seconds)
3
    Format to describe dates and times: https://en.wikipedia.org/wiki/ISO_8601
triple, Weekday with range (0 = Monday, ..., 6 = Sunday), and Duration expressed in seconds.
Notice that, with reference to the first two facts in Listing 1, the event 𝑒1 is prior to 𝑒2 since
(𝑝𝑒1 = 1) < (𝑝𝑒2 = 2). Also, it is possible to transform in sequence only certain days, months
or years, so that the analysis can be carried out at different granularity levels.

seq_event((1,9,2040),1,(out_call(simple),eudokia_makrembolitissa,florence_violet_mckenzie,
    acheson_boulevard,acheson_boulevard,(0,12,9),5,10)).
seq_event((1,9,2040),2,(out_call(simple),eudokia_makrembolitissa,florence_violet_mckenzie,
    acheson_boulevard,ashcott_street,(0,12,50),5,39)).
.
.
seq_event((2,9,2040),1,(in_sms(simple),annie_dillard,eudokia_makrembolitissa,alder_road,
    none,(9,22,26),6,0)).
seq_event((2,9,2040),2,(out_call(simple),eudokia_makrembolitissa,irena_jordanova,
    alexander_muir_road,adenmore_road,(11,55,29),6,82)).
.
.

       Listing 1: Some facts representing sequences of events in the DigForASP dataset.


   Additional pre-processing is required to create simpler and easier to analyze sequences out
of the seq_event atoms in Listing 1. The idea is to create sequences whose identifier refers to a
particular day describing what events on that day happened. For instance, in communication
sequences the event e refers to the (Caller, Callee) pair as shown in Listing 2.

seq((1,9,2040),1,(eudokia_makrembolitissa,florence_violet_mckenzie)).
seq((1,9,2040),2,(eudokia_makrembolitissa,florence_violet_mckenzie)).
.
.
seq((2,9,2040),1,(annie_dillard,eudokia_makrembolitissa)).
seq((2,9,2040),2,(eudokia_makrembolitissa,irena_jordanova)).
.
.

          Listing 2: Communication sequences generated from the facts in Listing 1.


4. Our ASP-based Approach
For the purposes of law enforcement investigations, it is especially useful to understand what
the extracted patterns are and what information they provide to the analyst. To this aim, we
have modified the basic algorithm provided by [5] in such a way as to elaborate patterns whose
items have a more complex structure, more precisely the structure of the DigForASP dataset (see
Section 3.2). In Section 4.1, for the sake of self-containment of this paper, we briefly describe the
base ASP encoding already presented in [6]. Then we introduce the ASP encoding for mining
so-called condensed representations (closed and maximal patterns) in Section 4.2, and finally
report the results of a comparative evaluation in Section 4.3.
     4.1. Mining Communication Sequences with ASP
     Listing 3 reports the base version of our adaptation of [5] to the problem first introduced in [6]
     of mining communication sequences. Here, we need to handle items with an internal structure
     (Lines 1, 11, and 13). Also, it is necessary to figure out in which and in how many daily sequences
     the patterns are found (Line 21). Patterns are then enriched with additional information such as
     the type of operation (Type) carried out between the two communicating entities (CC) and the
     precise time of day (Time) with the relative date (T ) (Line 22). Furthermore, since the dataset
     contains rows with undefined values (indicated with none), two constraints have been added
     to eliminate all patterns with none (Lines 25-26). Finally, in addition to the maxlen parameter,
     often used in pattern mining, the minlen parameter has been introduced with relative constraint
     to filter out patterns not sufficiently long (Line 29).

1    item(I) :- seq(_, _,(I, _, _)).
2
3 % sequential pattern generation
4 patpos(1).
5 { patpos(X+1) } :- patpos(X), X<maxlen.
6 patlen(L) :- patpos(L), not patpos(L+1).
7
8    1 {pat(X,I): item(I)} 1 :- patpos(X).
9
10 % pattern embeddings
11 occ(T,1,P) :- seq(T,P,(I, _, _)), pat(1,I).
12 occ(T,L,P) :- occ(T, L,   P-1), seq(T,P,_).
13 occ(T,L,P) :- occ(T, L-1, P-1), seq(T,P,(C, _, _)), pat(L,C).
14
15 % frequency constraint
16 seqlen(T,L) :- seq(T,L,_), not seq(T,L+1,_).
17 supp(T) :- occ(T, L, LS), patlen(L), seqlen(T,LS).
18 :- { supp(T) } < th.
19
20 % pattern information
21 len_support(N) :- N = #count{T : supp(T)}.
22 pat_information(T, (Pos, CC) , Type, Time) :- supp(T), pat(Pos, CC), seq(T, P, (CC, Type,
       Time)), occ(T, Pos, P).
23
24 % constraint for specific db with none line
25 :- pat(_, (none, _)).
26 :- pat(_, (_, none)).
27
28   % constraint for minimum pattern length
29   :- #count{T : pat(T, _)} < minlen.
30
31 % atoms to print
32 #show pat/2.
33 #show len_support/1.
34 #show support/1.
35 #show pat_information/4.

                Listing 3: Base ASP encoding for mining communication sequences [6].
        Each answer set returned by Listing 3 is a sequential pattern represented by means of the
     𝑝𝑎𝑡/2 predicate. As an example, Listing 4 reports the first answer out of the 15 generated by
     Listing 3 for a run over 100 instances from the DigForASP dataset, with maximum pattern
     length equal to 3 and minimum support threshold equal to 25%. It represents the sequential
     pattern which consists of two communication events: The first is between Margaret Hasse and
     Karen Cook McNally (Line 2), while the second is between Karen Cook McNally and Lucie Julia
     (Line 3). The pattern occurs in the days 8, 9 and 12 of September 2040 (see Line 4), and can be
     graphically presented as shown in Figure 1.

 1   Answer : 1
 2   pat ( 1 , ( margaret_hasse , karen_cook_mcnally ) )
 3   pat ( 2 , ( karen_cook_mcnally , l u c i e _ j u l i a ) )
 4   support ( ( 8 , 9 , 2 0 4 0 ) ) support ( ( 9 , 9 , 2 0 4 0 ) ) support ( ( 1 2 , 9 , 2 0 4 0 ) )
 5   p a t _ i n f o r m a t i o n ( ( 8 , 9 , 2 0 4 0 ) , ( 1 , ( margaret_hasse , karen_cook_mcnally ) ) , in_sms ( simple ) , ( 1 , 0 , 5 5 ) )
 6   p a t _ i n f o r m a t i o n ( ( 8 , 9 , 2 0 4 0 ) , ( 1 , ( margaret_hasse , karen_cook_mcnally ) ) , in_sms ( simple ) , ( 1 , 2 , 2 7 ) )
 7   p a t _ i n f o r m a t i o n ( ( 8 , 9 , 2 0 4 0 ) , ( 2 , ( k a r e n _ c o o k _ m c n a l l y , l u c i e _ j u l i a ) ) , out_sms ( s i m p l e ) , ( 8 , 5 5 , 9 ) )
 8   p a t _ i n f o r m a t i o n ( ( 8 , 9 , 2 0 4 0 ) , ( 2 , ( k a r e n _ c o o k _ m c n a l l y , l u c i e _ j u l i a ) ) , out_sms ( s i m p l e ) , ( 8 , 5 5 , 1 6 ) )
 9   p a t _ i n f o r m a t i o n ( ( 9 , 9 , 2 0 4 0 ) , ( 1 , ( margaret_hasse , karen_cook_mcnally ) ) , in_sms ( simple ) , ( 1 , 3 3 , 2 9 ) )
10   pat_information ( ( 9 , 9 , 2 0 4 0 ) , ( 2 , ( karen_cook_mcnally , l u c i e _ j u l i a ) ) , o u t _ c a l l ( simple ) , ( 1 0 , 2 4 , 9 ) )
11   pat_information ( ( 1 2 , 9 , 2 0 4 0 ) , ( 1 , ( margaret_hasse , karen_cook_mcnally ) ) , i n _ c a l l ( simple ) , ( 8 , 2 3 , 4 1 ) )
12   pat_information ( ( 1 2 , 9 , 2 0 4 0 ) , ( 2 , ( karen_cook_mcnally , l u c i e _ j u l i a ) ) , o u t _ c a l l ( simple ) , ( 8 , 2 6 , 1 7 ) )
13   len_support ( 3 )

                                                 Listing 4: An example of answer returned by Listing 3.


     Figure 1: The three occurrences of the sequential pattern corresponding to Answer 1 in Listing 4.


     4.2. Condensed representations for sequential pattern mining
     One of the major problems in pattern mining is still the problem of pattern explosion, i.e., the
     large amounts of patterns produced by the mining algorithms when analyzing a database with
     a predefined minimum support threshold. To address this issue, condensed representations for
     patterns have been proposed in the literature [rif?].
        In this Section we will discuss how additional requirements can be encoded with ASP in
     order to extract maximal and closed sequential patterns as shown in [5]. Let 𝒟 be a sequence
     database, 𝑠𝑢𝑝𝑝(𝑠, 𝒟) the support evaluation function, and 𝑡ℎ the minimum support threshold.
     A pattern 𝑠 is maximal, if there are no other patterns 𝑡 such that 𝑠 ⊆ 𝑡 and 𝑠𝑢𝑝𝑝(𝑠, 𝒟) ≥ 𝑡ℎ. A
     pattern 𝑠 is closed, if no other pattern 𝑡 exists such that 𝑠 ⊆ 𝑡 and 𝑠𝑢𝑝𝑝(𝑠, 𝒟) = 𝑠𝑢𝑝𝑝(𝑡, 𝒟).
1 % embeddings in a reverse order
2 rocc(T,L,P) :- seq(T,P,I), pat(L,I), patlen(L).
3 rocc(T,L,P) :- rocc(T, L, P+1), seq(T,P,_).
4 rocc(T,L,P) :- rocc(T, L+1, P+1), seq(T,P,C), pat(L,C).
5
6  % insertable items
7  ins(T,1 ,I) :- seq(T,P,I), rocc(T,1, P+1).
 8 ins(T,L+1,I) :- seq(T,P,I), occ(T,L,P-1), patlen(L).
 9 ins(T,X, I) :- seq(T,P,I), rocc(T,X,P+1),
10 occ(T,X-1,P-1), patpos(X), X>1.
11
12   % Integrity constraint for maximal patterns
13   :- item(I), X = 1..maxlen+1, { ins(T,X,I) : support(T) } >= th.
14
15 % Integrity constraint for closed patterns
16 :- item(I), X = 1..maxlen+1, { ins(T,X,I) } >=th,
17 ins(T,X,I) : support(T).

                 Listing 5: ASP encoding for maximal and closed sequential patterns [5].


        Listing 5 first describes how to define the set of items that can be inserted between successive
     items of an embedding (Lines 1-10). These itemsets are encoded by the atoms with predicate
     ins(T, X, I) where I is an item which can be inserted in an embedding of the current pattern in
     sequence T between items at position X and X+1 in the pattern. Here, only the positions of the
     last and the first valid occurrences are required for any pattern item. It can be observed that
     the strategy provides the first valid occurrence of an item X as the first atom of the occ(T, X, _)
     sequence. Then, computing the last occurrence for each pattern item can be done in a similar
     way by considering an embedding represented in reverse order. Lines 2 to 4 represent occ/3 and
     rocc/3 (reverse order) occurrences.
        The computation of insertable items (Lines 7-10) exploits the above remark. Line 7 defines
     the insertable region in a prefix using rocc(T, 1, P). Since items are insertable if they are strictly
     before the first position, we consider the value of rocc(T, 1, P+1). Line 8 uses occ(T, L, P) to
     identify the suffix region. Lines 9-10 combine both constraints for in-between cases.
        Listing 5 includes also the (integrity) constraints for dealing with closed and maximal patterns.
     To extract only maximal patterns, the constraint at Line 13 denies patterns for which it is possible
     to insert an item which will be frequent within sequences that support the current pattern.
     The constraint at Line 16 concerns the extraction of closed-patterns. It specifies that for each
     insertion position (from 1, in the prefix, to maxlen+1, in the suffix), it is not possible to have a
     frequent insertable item I for each supported transaction.
        Though interesting from a theoretical point of view, these encodings lead to more complex
     programs and should be more difficult to ground and to solve as stressed in [5].

     4.3. Experiments
     The goal of the experiments discussed in this Section is the comparative evaluation of the three
     versions of our ASP encoding for mining frequent, closed, and maximal sequential patterns,
     respectively. The comparison is done with respect to the number of patterns, the execution
time and the memory usage, by varying the key parameters. The experiments can be grouped
into two categories:
   1. Discovery of frequent, closed and maximal patterns by varying:
          • the minimum support threshold from 10% to 50% while leaving the maximum pattern
            length fixed to 5;
          • the maximum pattern length (1, 3, 5 and 8) while leaving the minimum support
            threshold fixed to 25%.
   2. Scalability tests:
          • discovery of frequent patterns by varying the dataset size (100, 1K, 10K), while
            leaving the minimum support threshold unchanged (set to 25%) and the maximum
            pattern length fixed to 3.
   All the experiments have been conducted over the largest available file of the DigForASP
dataset (named Karen Cook McNally) made up of more than 20,000 instances. Given the size,
a fairly long execution time for our ASP programs has been assumed. Therefore, the timeout
has been set to 5 hours. As a solver, we have used the version 5.4.0 of clingo, with default
solving parameters. The ASP programs were run on a laptop computer with Windows 10 (with
Ubuntu 20.04.4 subsystem), AMD Ryzen 5 3500U @ 2.10 GHz, 8GB RAM without using the
multi-threading mode of clingo. Multi-threading reduces the mean runtime but introduces
variance due to the random allocation of tasks. Such variance is inconvenient for interpreting
results with repeated executions.
   In Table 2 and Figure 2 it is possible to observe how the use of condensed representations, in
particular the maximal one, allows to reduce the number of extracted patterns. Considering the
case of frequent patterns, the computation ends before the time limit only for the threshold
set to 50%, while for the case of closed and maximal patterns the computation stopped at the
time limit set at 5 hours. In the first case, the behavior is due to a more restrictive constraint,
which allows to extract all the patterns in the dataset, while with a lower threshold values the
patterns are many more. Viceversa, condensed representations require longer times to extract
the patterns even at the same threshold values. As regards the memory usage (Table 2 and
Figure 3), one can observe that it decreases, as the minimum support threshold decreases, and
this holds for all condensed representations. In particular, the extraction of closed patterns uses
more memory than the discovery of frequent and maximal patterns.
   With a fixed threshold of 25% and the variation in the maximum pattern length (Table 3)
we can see a generalized trend in which both the time taken and the memory used increase
(Figure 5) as the length of the patterns increases. Notably, the memory usage is higher for
closed patterns as already observed in the previous experiments. Except for the case of frequent
patterns, the computation needed to extract patterns with a maximum length of 3 reaches the
time limit. Once again, maximal patterns drastically reduce the size of the output (Figure 4).
   For the scalability tests, values of 25% and 3 were chosen for the minimum support threshold
and maximum pattern length, respectively. As can be seen in Table 4, the memory used increases
as the size of the dataset increases (Figure 6). The time taken, except for frequent patterns,
reaches the computation time limit for condensed representations when the size of the dataset
is in the order of tens of thousands of rows (Figure 7). A general behavior concerns the closed
Table 2
Number of frequent (A), closed (B) and maximal (C) patterns generated as the minimum support
threshold varies, while leaving the maximum pattern length set to 5.
                                         (A) Frequent
          Threshold   # Patterns   Execution Time   Solver Time     Memory usage (MB)
             10%         3993       17,999.998s        17,995.37s        698.01
             20%         1026       17,999.999s        17,996.94s         652.2
             30%          413       17,999.999s        17,996.84s        471.54
             40%          77        17,999.999s        17,996.93s        411.93
             50%          78        14,819.002s        14,815.91s        306.06
                                          (B) Closed
          Threshold   # Patterns   Execution Time   Solver Time     Memory usage (MB)
             10%         6691       17,999.998s        17,983.82s        935.66
             15%          511       17,999.999s        17,989.97s        814.14
             16%         2456       17,999.999s        17,990.00s        804.68
             18%          213       17,999.999s        17,989.94s        962.55
             20%          524       17,999.999s        17,990.16s        890.42
             30%          310       18,000.001s        17,990.39s        767.39
             40%          147       17,999.999s        17,989.64s        752.97
             50%          11        17,999.999s        17,990.83s        748.29
                                         (C) Maximal
          Threshold   # Patterns   Execution Time   Solver Time     Memory usage (MB)
             10%         186        17,999.998s        17,989.93s        623.21
             15%         148        17,999.999s        17,993.43s        551.35
             20%         229        17,999.999s        17,993.61s        568.65
             30%         69         17,999.999s        17,993.62s        511.79
             40%         42         17,999.999s        17,993.79s        476.52
             50%         19         17,999.999s        17,993.72s        469.27


patterns: The used memory is always higher than for the other representations in all the
experiments done.


5. Final remarks
Pattern mining problems typically involve a combinatorial explosion. Condensed representa-
tions for patterns are among the solutions proposed in the literature to address this issue. In
this paper we have reported the results of a comparative evaluation of the ASP encodings for
solving three variants of the problem of mining sequential patterns, one being the base version
(discovery of frequent patterns) and the other two using condensed representations ( discovery
of closed and maximal patterns). As a case study we have considered the analysis of a real-world
dataset of anonymised phone recordings. The results show that condensed representations have
pros and cons. On one hand, they effectively reduce the number of patterns. On the other hand,
they tend to consume more computational resources.
Figure 2: Comparison as regards the number of patterns while varying the minimum support threshold
(Table 2).


Figure 3: Comparison as regards memory usage while varying the minimum support threshold (Table
2).


   For the future we intend to consider the case of localization events, so that evidence can be
analysed from a spatio-temporal viewpoint. Furthermore, we expect that the solicited feedback
from the DF experts involved in DigForASP will not only validate our work but will also suggest
new interesting directions of research.
Table 3
Number of frequent (A), closed (B) and maximal (C) sequential patterns generated as the maximum
pattern length varies from 1 to 8, while leaving the minimum support threshold fixed to 25%.
                                            (A) Frequent
         Max. length    # Patterns   Execution Time       Solver Time   Memory usage (MB)
               1            47            6.127s             4.18s            95.1
               3           769         10,562.276s         10,560.14s        306.89
               5           200         17,999.999s         17,996.69s        294.23
               8           1657        17,999.999s         17,989.19s        533.26
                                             (B) Closed
         Max. length    # Patterns   Execution Time       Solver Time   Memory usage (MB)
               1            47            9.522s             5.52s            196.74
               3           626         17,999.999s         17,994.01s         572.32
               5           449         17,999.999s         17,990.46s         916.69
               8           1474        17,999.999s         17,984.96s        1157.45
                                            (C) Maximal
         Max. length    # Patterns   Execution Time       Solver Time   Memory usage (MB)
               1            15           75.903s             73.26s          115.03
               3           309         17,999.999s         17,994.35s        394.69
               5           207         17,999.999s         17,993.78s        613.96
               8           141         17,999.999s         17,990.09s        764.98


Acknowledgments
This article is based upon work from COST Action 17124 “Digital forensics: evidence analysis
via intelligent systems and practices (DigForASP)”, supported by COST (European Cooperation
in Science and Technology). The work is also partially funded by the University of Bari “Aldo
Moro” under the 2017-2018 grant “Metodi di Intelligenza Artificiale per l’Informatica Forense”.


References
 [1] S. Costantini, G. De Gasperis, R. Olivieri, How answer set programming can help in
     digital forensic investigation, in: D. Ancona, M. Maratea, V. Mascardi (Eds.), Proceedings
     of the 30th Italian Conference on Computational Logic, Genova, Italy, July 1-3, 2015,
     volume 1459 of CEUR Workshop Proceedings, CEUR-WS.org, 2015, pp. 53–65. URL: http:
     //ceur-ws.org/Vol-1459/paper29.pdf.
 [2] S. Costantini, G. De Gasperis, R. Olivieri, Digital forensics and investigations meet artificial
     intelligence, Ann. Math. Artif. Intell. 86 (2019) 193–229. URL: https://doi.org/10.1007/
     s10472-019-09632-y. doi:10.1007/s10472-019-09632-y.
 [3] S. Costantini, F. A. Lisi, R. Olivieri, DigForASP: A European cooperation network for
     logic-based AI in digital forensics, in: A. Casagrande, E. G. Omodeo (Eds.), Proceedings
     of the 34th Italian Conference on Computational Logic, Trieste, Italy, June 19-21, 2019,
Figure 4: Comparison as regards the number of patterns by varying the maximum pattern length (Table
3).


Figure 5: Comparison as regards memory usage by varying the maximum pattern length (Table 3).


     volume 2396 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 138–146. URL: http:
     //ceur-ws.org/Vol-2396/paper34.pdf.
 [4] M. Gebser, T. Guyet, R. Quiniou, J. Romero, T. Schaub, Knowledge-based sequence mining
     with asp, in: IJCAI 2016-25th International joint conference on artificial intelligence, AAAI,
     2016, p. 8.
 [5] T. Guyet, Y. Moinard, R. Quiniou, T. Schaub, Efficiency analysis of asp encodings for
     sequential pattern mining tasks, in: Advances in Knowledge Discovery and Management,
     Springer, 2018, pp. 41–81.
 [6] F. A. Lisi, G. Sterlicchio, Declarative pattern mining in digital forensics: Preliminary results,
Table 4
Number of frequent (A), closed (B) and maximal (C) sequential patterns generated as the dataset size
varies (100, 1K, 10K), while leaving the minimum support threshold (25%) and maximum pattern length
(3) unchanged.
                                            (A) Frequent
    # Rows    # Patterns   Execution Time    Solver Time    Memory usage (MB)      # Sequences
      100         15           0.084ss           0.03s             14.52                6
     1000        9821          81.993s          81.83s             72.81                9
     10000        51          2978.229s        2976.73s            170.2                93
                                             (B) Closed
    # Rows    # Patterns   Execution Time    Solver Time    Memory usage (MB)      # Sequences
      100          7            0.091s           0.03s             15.73                6
     1000         171          13.446s          13.22s             36.50                9
     10000        625        18,000.012s      17,997.07s           281.72               93
                                            (C) Maximal
    # Rows    # Patterns   Execution Time    Solver Time    Memory usage (MB)      # Sequences
      100          6            0.099s           0.02s             15.33                6
     1000         32           10.928s          10.60s             30.75                9
     10000        153        18,000.001s      17,997.71s           265.33               93


     in: R. Calegari, G. Ciatto, A. Omicini (Eds.), Proceedings of the 37th Italian Conference on
     Computational Logic, Bologna, Italy, June 29 - July 1, 2022, volume 3204 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2022, pp. 232–246. URL: http://ceur-ws.org/Vol-3204/paper_23.
     pdf.
 [7] G. Brewka, T. Eiter, M. Truszczynski, Answer set programming at a glance, Communi-
     cations of the ACM 54 (2011) 92–103. URL: http://doi.acm.org/10.1145/2043174.2043195.
     doi:10.1145/2043174.2043195.
 [8] M. Gebser, B. Kaufmann, R. Kaminski, M. Ostrowski, T. Schaub, M. Schneider, Potassco:
     The potsdam answer set solving collection, Ai Communications 24 (2011) 107–124.
 [9] M. Gebser, R. Kaminski, B. Kaufmann, T. Schaub, Clingo= asp+ control: Preliminary report,
     arXiv preprint arXiv:1405.3694 (2014).
[10] B. Negrevergne, T. Guns, Constraint-based sequence mining using constraint program-
     ming, in: International Conference on Integration of Constraint Programming, Artificial
     Intelligence, and Operations Research, Springer, 2015, pp. 288–305.
[11] M. Järvisalo, Itemset mining as a challenge application for answer set enumeration, in:
     International Conference on Logic Programming and Nonmonotonic Reasoning, Springer,
     2011, pp. 304–310.
Figure 6: Comparison as regards memory usage at the variation of the dataset size (Table 4).


Figure 7: Comparison as regards execution time at the variation of the dataset size (Table 4).