<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning binary classification rules for sequential data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marine Collery</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Remy Kusters</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM France Lab</institution>
          ,
          <addr-line>Orsay</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IBM Research</institution>
          ,
          <addr-line>Orsay</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Inria Saclay Ile-de-France</institution>
          ,
          <addr-line>Palaiseau</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Discovering patterns for classification of sequential data is of key importance for a variety of fields, ranging from genomics to fraud detection. In this short paper, we propose a diferentiable method to discover both local and global patterns for rule-based binary classification. Key to this end-to-end diferentiable approach is that the patterns used in the rules are learned alongside the rules themselves.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Sequence classification using rules composed of
“classiifcation relevant patterns” is a problem that has received
considerable attention in statistical machine learning and
data mining due to its applications in e.g. speech
processing, fraud detection or genomics. There exists a wide
literature that combines (un)supervised pattern mining
techniques with sequence classification. Sequential
pattern mining has focused to a great extent on mining
frequent symbolic subsequences [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In feature-based
classification, there are two approaches for capturing the
sequential nature of features available for classification:
either through preprocessing or learned simultaneously
with the classification task itself. The present work
extends an existing literature that learns classification rules
over sequential data [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] with a diferentiable approach
that builds on top of similar methods for binary tabular
data [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The novelty of our work lies in using learned
patterns as atoms in a rule-based classifier for sequential
data.
      </p>
      <p>In this paper, we propose a diferentiable rule learning
classification model for sequential data where the
conditions are composed of sequence-dependent patterns
that are discovered alongside the classification task itself.</p>
      <p>
        More precisely, we aim at learning a rule of the following
structure: if pattern then class = 1 else class = 0. In
particular we consider two types of patterns: local and global
patterns as introduced in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A local pattern describes a
subsequence at a specific position in the sequence while a
global pattern is invariant to the location in the sequence
(see Figure 1 for an example).
      </p>
      <p>
        Model The base rule model (Figure 1) we invoke is
composed of two consecutive layers that respectively
mimic logical AND and OR operators (inspired by the
rule learning modules in [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The AND layer takes
binary features (which are atomic boolean formulae) as
input and its output is used as input by the OR layer. The
output of the OR layer is mapped to the classification
label . These two layers are defined with binary weights
that select the nodes which are included in the respective
boolean expression (conjunction or disjunction). In other
words, this network implements the evaluation of a DNF.
      </p>
      <p>This model has a direct equivalence with a binary
classification rule like if (∧)∨ then class = 1 else class = 0,
where ,  and  are binary input features (atoms in
logical terms) 1.</p>
      <p>In this paper, we apply the base rule model as a
1Dconvolutional window of fixed length over a sequence
and retrieve all outputs as input for an additional
disjunctive layer which we refer to as the Conv-OR layer. The
base rule model learns a boolean expression over the
window size length and the Conv-OR layer indicates where
along the sequence that logical expression is valid. If
the evaluation of the logical expression is valid all along
the sequence then it can be described as a global pattern,
otherwise the learned pattern represents a local pattern.</p>
      <p>Expressivity With this approach, diferent
sequencedependent expressions can be extracted and their nature
depends on the learned weights of the Conv-OR layer
(Figure 1).</p>
      <p>• If all the weights of the Conv-OR layer are
activated (i.e. equal to 1), the logical expression
learned by the base model is valid in all the
sequence: a global pattern is learned.
• If only some of the weight of the Conv-OR layer
are activated, the logical expression learned by
the base model is valid only in the window
associated to that weight: a local pattern is learned. The
1A natural extension of this architecture for sequential data would
be to extend this base rule model with an explicit recursion of the
base rule model, similar to a RNN. This approach was tested but
faced the same limitations as any classical RNNs, i.e., vanishing
gradients and only captures short-term dependencies.
base model logical expression is modified
accordingly to match that shift (see example in Figure 1
with a shift of 3 sequential steps).</p>
      <p>The obtained weights thus translate to rules with the
following grammar:
global pattern is an expression describing the
presence of a pattern anywhere in the sequence, for
example B-D in sequence, where “− " sign
refers to “followed by" in global patterns.
global pattern over window B-*-D
[t-6; t-3]
in
window
local pattern is a boolean expression composed of
atoms that are TRUE at a specific position, for
example A at t-15.
rule
expr
→
→
if expr then class = 1 else class = 0
local pattern | global pattern | sequence prop</p>
      <p>
        We introduce , the position when the last observation
in a sequence was made. With  being our reference, in a
sequence of size  ∈ N,  −  refers to the moment of the
th observation before  (0 ≤  ≤  − 1). , ,  and
 are toy binary input features, for simplicity of the ex- Those expressions could then be used as input to another
ample those features can not be activated simultaneously AND/OR layer model for instance, for extending the rule
at the same position  in the sequence but the method is complexity.
still valid for complex symbolic sequences [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        With those definitions, we list below examples of
different sequence-dependent atoms that can be expressed
with the proposed architecture (See Figure 1):
condition on sequence property is a condition on
the sequence length for example 4 ≤ length
of sequence ≤ 6 (not shown on the figure but
it corresponds to a specific case where the base
model has learned an empty rule.)
Sparsity Requirements In order to learn those
expressions, especially the global ones, the model needs
to generalize without observing all possible instances at
training time. The first requirement for that matter is
sparsity in the base model. The approach taken follows
a sparsify-during-training method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and dynamically
enforce sparsity in weights from 0% to 100% [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
model with the highest prediction accuracy on validation
dataset and the highest sparsity is kept.
      </p>
      <p>Preliminary results Experiments with synthetic toy
examples containing ground-truth patterns as those
expressed in Figure 1 can be discovered from moderately
small samples sizes (See Appendix A for details).
Variations and extensions of these pattern have shown
promising results and further tests need to be pursued to scope
the range of discoverable patterns with this approach.</p>
      <p>Limitations There are limitations to this architecture.</p>
      <p>The main one being that the window size is fixed and that
it limits the size of the patterns that can be found. There is
a trade-of between the maximum size of patterns and the
training complexity that has to be further investigated.</p>
      <p>
        Conclusion To conclude, we presented a
1Dconvolutional neural architecture to discover local and
global patterns in sequences while learning binary
classification rules. This architecture is fully diferentiable
and requires sparsity that is enforced dynamically. Its
main limitation is its dependence to the window size
parameter. Further work will consist in integrating
this block into more complex architectures to augment
the expressivity of the learned rules. Moreover, the
algorithm will be tested on concrete datasets such as
UCI splice dataset or E. Coli promoter gene sequences
dataset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to demonstrate its ability to discover rules
with learned non-trivial patterns.
      </p>
      <p>Acknowledgments
This work has been partially funded by the French
government as part of project PSPC AIDA 2019-PSPC-09,
in the framework of the “Programme d’Investissement
d’Avenir”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kamber</surname>
          </string-name>
          , J. Pei,
          <fpage>13</fpage>
          - Data Mining Trends and Research Frontiers, in: J. Han,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kamber</surname>
          </string-name>
          , J. Pei (Eds.),
          <source>Data Mining (Third Edition)</source>
          , The Morgan Kaufmann Series in Data Management Systems, Morgan Kaufmann, Boston,
          <year>2012</year>
          , pp.
          <fpage>585</fpage>
          -
          <lpage>631</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/B9780123814791000137. doi:
          <volume>10</volume>
          .1016/ B978-0
          <source>-12-381479-1</source>
          .
          <fpage>00013</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cule</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Goethals</surname>
          </string-name>
          ,
          <article-title>Pattern Based Sequence Classification</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>28</volume>
          (
          <year>2015</year>
          )
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          . doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2015</year>
          .
          <volume>2510010</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Egho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Trinquart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Boullé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Voisine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Clérot</surname>
          </string-name>
          ,
          <string-name>
            <surname>MiSeRe-Hadoop</surname>
          </string-name>
          :
          <article-title>A Large-Scale Robust Sequential Classification Rules Mining Framework</article-title>
          , in: L.
          <string-name>
            <surname>Bellatreche</surname>
          </string-name>
          , S. Chakravarthy (Eds.),
          <source>Big Data Analytics and Knowledge Discovery</source>
          , volume
          <volume>10440</volume>
          , Springer International Publishing, Cham,
          <year>2017</year>
          , pp.
          <fpage>105</fpage>
          -
          <lpage>119</lpage>
          . URL: http://link.springer. com/10.1007/978-3-
          <fpage>319</fpage>
          -64283-
          <issue>3</issue>
          _8. doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -64283-
          <issue>3</issue>
          _8,
          <string-name>
            <surname>series</surname>
            <given-names>Title</given-names>
          </string-name>
          : Lecture Notes in Computer Science.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Learning Accurate and Interpretable Decision Rule Sets from Neural Networks</article-title>
          , arXiv:
          <fpage>2103</fpage>
          .02826 [cs] (
          <year>2021</year>
          ). URL: http://arxiv.org/abs/2103.02826, arXiv:
          <fpage>2103</fpage>
          .
          <fpage>02826</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kusters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Collery</surname>
          </string-name>
          , C. d. S. Marie,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <article-title>Diferentiable Rule Induction with Learned Relational Features</article-title>
          , arXiv:
          <fpage>2201</fpage>
          .06515 [cs, stat] (
          <year>2022</year>
          ). URL: http://arxiv.org/abs/2201.06515, arXiv:
          <fpage>2201</fpage>
          .
          <fpage>06515</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <article-title>On efective classification of strings with wavelets, in: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining</article-title>
          ,
          <source>KDD '02</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2002</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>172</lpage>
          . URL: https://doi.org/10.1145/ 775047.775071. doi:
          <volume>10</volume>
          .1145/775047.775071.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Keogh</surname>
          </string-name>
          ,
          <article-title>A brief survey on sequence classification</article-title>
          ,
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>12</volume>
          (
          <year>2010</year>
          )
          <fpage>40</fpage>
          -
          <lpage>48</lpage>
          . URL: https://dl.acm.org/doi/ 10.1145/1882471.1882478. doi:
          <volume>10</volume>
          .1145/1882471. 1882478.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hoefler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Alistarh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ben-Nun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dryden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peste</surname>
          </string-name>
          ,
          <article-title>Sparsity in Deep Learning: Pruning and growth for eficient inference and training in neural networks</article-title>
          ,
          <source>arXiv:2102</source>
          .00554 [cs] (
          <year>2021</year>
          ). URL: http://arxiv.org/abs/2102.00554, arXiv:
          <fpage>2102</fpage>
          .
          <fpage>00554</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. U.</given-names>
            <surname>Stich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dmitriev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jaggi</surname>
          </string-name>
          , Dynamic Model Pruning with Feedback, arXiv:
          <year>2006</year>
          .07253 [cs, stat] (
          <year>2020</year>
          ). URL: http: //arxiv.org/abs/
          <year>2006</year>
          .07253, arXiv:
          <year>2006</year>
          .07253.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Graf</surname>
          </string-name>
          , UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences,
          <year>2017</year>
          . URL: http://archive.ics.uci.edu/ml.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>