Learning binary classification rules for sequential data
Marine Collery1,2,* , Remy Kusters2,3
1
  IBM France Lab, Orsay, France
2
  Inria Saclay Ile-de-France, Palaiseau, France
3
  IBM Research, Orsay, France


                                             Abstract
                                             Discovering patterns for classification of sequential data is of key importance for a variety of fields, ranging from genomics
                                             to fraud detection. In this short paper, we propose a differentiable method to discover both local and global patterns for
                                             rule-based binary classification. Key to this end-to-end differentiable approach is that the patterns used in the rules are
                                             learned alongside the rules themselves.


   Sequence classification using rules composed of “classi-                                                                           rule learning modules in [4, 5]. The AND layer takes
fication relevant patterns” is a problem that has received                                                                            binary features (which are atomic boolean formulae) as
considerable attention in statistical machine learning and                                                                            input and its output is used as input by the OR layer. The
data mining due to its applications in e.g. speech pro-                                                                               output of the OR layer is mapped to the classification la-
cessing, fraud detection or genomics. There exists a wide                                                                             bel 𝑦𝑖 . These two layers are defined with binary weights
literature that combines (un)supervised pattern mining                                                                                that select the nodes which are included in the respective
techniques with sequence classification. Sequential pat-                                                                              boolean expression (conjunction or disjunction). In other
tern mining has focused to a great extent on mining                                                                                   words, this network implements the evaluation of a DNF.
frequent symbolic subsequences [1]. In feature-based                                                                                  This model has a direct equivalence with a binary classifi-
classification, there are two approaches for capturing the                                                                            cation rule like if (𝐴∧𝐵)∨𝐶 then class = 1 else class = 0,
sequential nature of features available for classification:                                                                           where 𝐴, 𝐵 and 𝐶 are binary input features (atoms in
either through preprocessing or learned simultaneously                                                                                logical terms) 1 .
with the classification task itself. The present work ex-                                                                                In this paper, we apply the base rule model as a 1D-
tends an existing literature that learns classification rules                                                                         convolutional window of fixed length over a sequence
over sequential data [2, 3] with a differentiable approach                                                                            and retrieve all outputs as input for an additional disjunc-
that builds on top of similar methods for binary tabular                                                                              tive layer which we refer to as the Conv-OR layer. The
data [4, 5]. The novelty of our work lies in using learned                                                                            base rule model learns a boolean expression over the win-
patterns as atoms in a rule-based classifier for sequential                                                                           dow size length and the Conv-OR layer indicates where
data.                                                                                                                                 along the sequence that logical expression is valid. If
   In this paper, we propose a differentiable rule learning                                                                           the evaluation of the logical expression is valid all along
classification model for sequential data where the con-                                                                               the sequence then it can be described as a global pattern,
ditions are composed of sequence-dependent patterns                                                                                   otherwise the learned pattern represents a local pattern.
that are discovered alongside the classification task itself.
More precisely, we aim at learning a rule of the following                                                                            Expressivity With this approach, different sequence-
structure: if pattern then class = 1 else class = 0. In par-                                                                          dependent expressions can be extracted and their nature
ticular we consider two types of patterns: local and global                                                                           depends on the learned weights of the Conv-OR layer
patterns as introduced in [6]. A local pattern describes a                                                                            (Figure 1).
subsequence at a specific position in the sequence while a
global pattern is invariant to the location in the sequence                                                                                    • If all the weights of the Conv-OR layer are ac-
(see Figure 1 for an example).                                                                                                                   tivated (i.e. equal to 1), the logical expression
                                                                                                                                                 learned by the base model is valid in all the se-
                                                                                                                                                 quence: a global pattern is learned.
Model The base rule model (Figure 1) we invoke is
composed of two consecutive layers that respectively                                                                                           • If only some of the weight of the Conv-OR layer
mimic logical AND and OR operators (inspired by the                                                                                              are activated, the logical expression learned by
                                                                                                                                                 the base model is valid only in the window associ-
                                                                                                                                                 ated to that weight: a local pattern is learned. The
STRL’22: First International Workshop on Spatio-Temporal Reasoning
                                                                                                                                      1
and Learning, July 24, 2022, Vienna, Austria                                                                                              A natural extension of this architecture for sequential data would
*
 Corresponding author.                                                                                                                    be to extend this base rule model with an explicit recursion of the
$ marine.collery@ibm.com (M. Collery)                                                                                                     base rule model, similar to a RNN. This approach was tested but
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                         faced the same limitations as any classical RNNs, i.e., vanishing
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                            gradients and only captures short-term dependencies.
Figure 1: Architecture of the model with an example on a sequence of letters. The base rule model is applied as 1D-convolutional
window over the sequence. The resulting boolean values are given as input of the Conv-OR layer which indicates through its
activated weights where along the sequence the expression learned by the base model is true. The output of the Conv-OR
layer is mapped to the label of the sequence 𝑦𝑖 . For local patterns, the base model expression needs to be shifted accordingly
to the Conv-OR Layer weights.


       base model logical expression is modified accord- global pattern is an expression describing the pres-
       ingly to match that shift (see example in Figure 1        ence of a pattern anywhere in the sequence, for
       with a shift of 3 sequential steps).                      example B-D in sequence, where “−" sign
                                                                 refers to “followed by" in global patterns.
   The obtained weights thus translate to rules with the
following grammar:                                        global pattern over window B-*-D in window
                                                                         [t-6; t-3]
 rule    →    if expr then class = 1 else class = 0
 expr    →    local pattern | global pattern | sequence prop     condition on sequence property is a condition on
                                                                       the sequence length for example 4 ≤ length
   We introduce 𝑡, the position when the last observation
                                                                       of sequence ≤ 6 (not shown on the figure but
in a sequence was made. With 𝑡 being our reference, in a
                                                                       it corresponds to a specific case where the base
sequence of size 𝑁 ∈ N, 𝑡 − 𝑖 refers to the moment of the
                                                                       model has learned an empty rule.)
𝑖th observation before 𝑡 (0 ≤ 𝑖 ≤ 𝑁 − 1). 𝐴, 𝐵, 𝐶 and
𝐷 are toy binary input features, for simplicity of the ex-       Those expressions could then be used as input to another
ample those features can not be activated simultaneously         AND/OR layer model for instance, for extending the rule
at the same position 𝑡 in the sequence but the method is         complexity.
still valid for complex symbolic sequences [7].
   With those definitions, we list below examples of dif-        Sparsity Requirements In order to learn those ex-
ferent sequence-dependent atoms that can be expressed            pressions, especially the global ones, the model needs
with the proposed architecture (See Figure 1):                   to generalize without observing all possible instances at
local pattern is a boolean expression composed of                training time. The first requirement for that matter is
       atoms that are TRUE at a specific position, for           sparsity in the base model. The approach taken follows
       example A at t-15.                                        a sparsify-during-training method [8] and dynamically
enforce sparsity in weights from 0% to 100% [9]. The              [3] E. Egho, D. Gay, R. Trinquart, M. Boullé, N. Voi-
model with the highest prediction accuracy on validation              sine, F. Clérot, MiSeRe-Hadoop: A Large-Scale Ro-
dataset and the highest sparsity is kept.                             bust Sequential Classification Rules Mining Frame-
                                                                      work, in: L. Bellatreche, S. Chakravarthy (Eds.),
Preliminary results Experiments with synthetic toy                    Big Data Analytics and Knowledge Discovery,
examples containing ground-truth patterns as those ex-                volume 10440, Springer International Publishing,
pressed in Figure 1 can be discovered from moderately                 Cham, 2017, pp. 105–119. URL: http://link.springer.
small samples sizes (See Appendix A for details). Varia-              com/10.1007/978-3-319-64283-3_8. doi:10.1007/
tions and extensions of these pattern have shown promis-              978-3-319-64283-3_8, series Title: Lecture
ing results and further tests need to be pursued to scope             Notes in Computer Science.
the range of discoverable patterns with this approach.            [4] L. Qiao, W. Wang, B. Lin, Learning Accurate
                                                                      and Interpretable Decision Rule Sets from Neu-
Limitations There are limitations to this architecture.               ral Networks, arXiv:2103.02826 [cs] (2021). URL:
The main one being that the window size is fixed and that             http://arxiv.org/abs/2103.02826, arXiv: 2103.02826.
it limits the size of the patterns that can be found. There is    [5] R. Kusters, Y. Kim, M. Collery, C. d. S. Marie,
a trade-off between the maximum size of patterns and the              S. Gupta, Differentiable Rule Induction with
training complexity that has to be further investigated.              Learned Relational Features, arXiv:2201.06515 [cs,
                                                                      stat] (2022). URL: http://arxiv.org/abs/2201.06515,
                                                                      arXiv: 2201.06515.
Conclusion To conclude, we presented a 1D-
                                                                  [6] C. C. Aggarwal, On effective classification of strings
convolutional neural architecture to discover local and
                                                                      with wavelets, in: Proceedings of the eighth ACM
global patterns in sequences while learning binary clas-
                                                                      SIGKDD international conference on Knowledge
sification rules. This architecture is fully differentiable
                                                                      discovery and data mining, KDD ’02, Association
and requires sparsity that is enforced dynamically. Its
                                                                      for Computing Machinery, New York, NY, USA,
main limitation is its dependence to the window size
                                                                      2002, pp. 163–172. URL: https://doi.org/10.1145/
parameter. Further work will consist in integrating
                                                                      775047.775071. doi:10.1145/775047.775071.
this block into more complex architectures to augment
                                                                  [7] Z. Xing, J. Pei, E. Keogh, A brief survey on sequence
the expressivity of the learned rules. Moreover, the
                                                                      classification, ACM SIGKDD Explorations Newslet-
algorithm will be tested on concrete datasets such as
                                                                      ter 12 (2010) 40–48. URL: https://dl.acm.org/doi/
UCI splice dataset or E. Coli promoter gene sequences
                                                                      10.1145/1882471.1882478. doi:10.1145/1882471.
dataset [10] to demonstrate its ability to discover rules
                                                                      1882478.
with learned non-trivial patterns.
                                                                  [8] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden,
                                                                      A. Peste, Sparsity in Deep Learning: Pruning and
Acknowledgments                                                       growth for efficient inference and training in neu-
                                                                      ral networks, arXiv:2102.00554 [cs] (2021). URL:
This work has been partially funded by the French gov-                http://arxiv.org/abs/2102.00554, arXiv: 2102.00554.
ernment as part of project PSPC AIDA 2019-PSPC-09,                [9] T. Lin, S. U. Stich, L. Barba, D. Dmitriev,
in the framework of the “Programme d’Investissement                   M. Jaggi, Dynamic Model Pruning with Feed-
d’Avenir”.                                                            back, arXiv:2006.07253 [cs, stat] (2020). URL: http:
                                                                      //arxiv.org/abs/2006.07253, arXiv: 2006.07253.
                                                                 [10] D. Dua, C. Graff, UCI Machine Learning Repos-
References                                                            itory, University of California, Irvine, School of
                                                                      Information and Computer Sciences, 2017. URL:
 [1] J. Han, M. Kamber, J. Pei, 13 - Data Mining Trends
                                                                      http://archive.ics.uci.edu/ml.
     and Research Frontiers, in: J. Han, M. Kamber,
     J. Pei (Eds.), Data Mining (Third Edition), The Mor-
     gan Kaufmann Series in Data Management Sys-
     tems, Morgan Kaufmann, Boston, 2012, pp. 585–               A. Experiments with toy examples
     631. URL: https://www.sciencedirect.com/science/
     article/pii/B9780123814791000137. doi:10.1016/              The model is tested on toy synthetic datasets that are
     B978-0-12-381479-1.00013-7.
                                                                 generated to fit the rules presented in Figure 3. Datasets
 [2] C. Zhou, B. Cule, B. Goethals, Pattern Based                are composed of 10000 sequences from which 10% is used
     Sequence Classification, IEEE Transactions on               for validation and an other 10% for testing; the rest being
     Knowledge and Data Engineering 28 (2015) 1–1.               used for training (batch of 100). We used Adam optimizer
     doi:10.1109/TKDE.2015.2510010.                              with a learning rate set to 0.1. In this setup, we obtain
                                                                 100% accuracy in 8 out of 10 training of 200 epochs.