Learning binary classification rules for sequential data Marine Collery1,2,* , Remy Kusters2,3 1 IBM France Lab, Orsay, France 2 Inria Saclay Ile-de-France, Palaiseau, France 3 IBM Research, Orsay, France Abstract Discovering patterns for classification of sequential data is of key importance for a variety of fields, ranging from genomics to fraud detection. In this short paper, we propose a differentiable method to discover both local and global patterns for rule-based binary classification. Key to this end-to-end differentiable approach is that the patterns used in the rules are learned alongside the rules themselves. Sequence classification using rules composed of “classi- rule learning modules in [4, 5]. The AND layer takes fication relevant patterns” is a problem that has received binary features (which are atomic boolean formulae) as considerable attention in statistical machine learning and input and its output is used as input by the OR layer. The data mining due to its applications in e.g. speech pro- output of the OR layer is mapped to the classification la- cessing, fraud detection or genomics. There exists a wide bel 𝑦𝑖 . These two layers are defined with binary weights literature that combines (un)supervised pattern mining that select the nodes which are included in the respective techniques with sequence classification. Sequential pat- boolean expression (conjunction or disjunction). In other tern mining has focused to a great extent on mining words, this network implements the evaluation of a DNF. frequent symbolic subsequences [1]. In feature-based This model has a direct equivalence with a binary classifi- classification, there are two approaches for capturing the cation rule like if (𝐴∧𝐵)∨𝐶 then class = 1 else class = 0, sequential nature of features available for classification: where 𝐴, 𝐵 and 𝐶 are binary input features (atoms in either through preprocessing or learned simultaneously logical terms) 1 . with the classification task itself. The present work ex- In this paper, we apply the base rule model as a 1D- tends an existing literature that learns classification rules convolutional window of fixed length over a sequence over sequential data [2, 3] with a differentiable approach and retrieve all outputs as input for an additional disjunc- that builds on top of similar methods for binary tabular tive layer which we refer to as the Conv-OR layer. The data [4, 5]. The novelty of our work lies in using learned base rule model learns a boolean expression over the win- patterns as atoms in a rule-based classifier for sequential dow size length and the Conv-OR layer indicates where data. along the sequence that logical expression is valid. If In this paper, we propose a differentiable rule learning the evaluation of the logical expression is valid all along classification model for sequential data where the con- the sequence then it can be described as a global pattern, ditions are composed of sequence-dependent patterns otherwise the learned pattern represents a local pattern. that are discovered alongside the classification task itself. More precisely, we aim at learning a rule of the following Expressivity With this approach, different sequence- structure: if pattern then class = 1 else class = 0. In par- dependent expressions can be extracted and their nature ticular we consider two types of patterns: local and global depends on the learned weights of the Conv-OR layer patterns as introduced in [6]. A local pattern describes a (Figure 1). subsequence at a specific position in the sequence while a global pattern is invariant to the location in the sequence • If all the weights of the Conv-OR layer are ac- (see Figure 1 for an example). tivated (i.e. equal to 1), the logical expression learned by the base model is valid in all the se- quence: a global pattern is learned. Model The base rule model (Figure 1) we invoke is composed of two consecutive layers that respectively • If only some of the weight of the Conv-OR layer mimic logical AND and OR operators (inspired by the are activated, the logical expression learned by the base model is valid only in the window associ- ated to that weight: a local pattern is learned. The STRL’22: First International Workshop on Spatio-Temporal Reasoning 1 and Learning, July 24, 2022, Vienna, Austria A natural extension of this architecture for sequential data would * Corresponding author. be to extend this base rule model with an explicit recursion of the $ marine.collery@ibm.com (M. Collery) base rule model, similar to a RNN. This approach was tested but © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). faced the same limitations as any classical RNNs, i.e., vanishing CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) gradients and only captures short-term dependencies. Figure 1: Architecture of the model with an example on a sequence of letters. The base rule model is applied as 1D-convolutional window over the sequence. The resulting boolean values are given as input of the Conv-OR layer which indicates through its activated weights where along the sequence the expression learned by the base model is true. The output of the Conv-OR layer is mapped to the label of the sequence 𝑦𝑖 . For local patterns, the base model expression needs to be shifted accordingly to the Conv-OR Layer weights. base model logical expression is modified accord- global pattern is an expression describing the pres- ingly to match that shift (see example in Figure 1 ence of a pattern anywhere in the sequence, for with a shift of 3 sequential steps). example B-D in sequence, where “−" sign refers to “followed by" in global patterns. The obtained weights thus translate to rules with the following grammar: global pattern over window B-*-D in window [t-6; t-3] rule → if expr then class = 1 else class = 0 expr → local pattern | global pattern | sequence prop condition on sequence property is a condition on the sequence length for example 4 ≤ length We introduce 𝑡, the position when the last observation of sequence ≤ 6 (not shown on the figure but in a sequence was made. With 𝑡 being our reference, in a it corresponds to a specific case where the base sequence of size 𝑁 ∈ N, 𝑡 − 𝑖 refers to the moment of the model has learned an empty rule.) 𝑖th observation before 𝑡 (0 ≤ 𝑖 ≤ 𝑁 − 1). 𝐴, 𝐵, 𝐶 and 𝐷 are toy binary input features, for simplicity of the ex- Those expressions could then be used as input to another ample those features can not be activated simultaneously AND/OR layer model for instance, for extending the rule at the same position 𝑡 in the sequence but the method is complexity. still valid for complex symbolic sequences [7]. With those definitions, we list below examples of dif- Sparsity Requirements In order to learn those ex- ferent sequence-dependent atoms that can be expressed pressions, especially the global ones, the model needs with the proposed architecture (See Figure 1): to generalize without observing all possible instances at local pattern is a boolean expression composed of training time. The first requirement for that matter is atoms that are TRUE at a specific position, for sparsity in the base model. The approach taken follows example A at t-15. a sparsify-during-training method [8] and dynamically enforce sparsity in weights from 0% to 100% [9]. The [3] E. Egho, D. Gay, R. Trinquart, M. Boullé, N. Voi- model with the highest prediction accuracy on validation sine, F. Clérot, MiSeRe-Hadoop: A Large-Scale Ro- dataset and the highest sparsity is kept. bust Sequential Classification Rules Mining Frame- work, in: L. Bellatreche, S. Chakravarthy (Eds.), Preliminary results Experiments with synthetic toy Big Data Analytics and Knowledge Discovery, examples containing ground-truth patterns as those ex- volume 10440, Springer International Publishing, pressed in Figure 1 can be discovered from moderately Cham, 2017, pp. 105–119. URL: http://link.springer. small samples sizes (See Appendix A for details). Varia- com/10.1007/978-3-319-64283-3_8. doi:10.1007/ tions and extensions of these pattern have shown promis- 978-3-319-64283-3_8, series Title: Lecture ing results and further tests need to be pursued to scope Notes in Computer Science. the range of discoverable patterns with this approach. [4] L. Qiao, W. Wang, B. Lin, Learning Accurate and Interpretable Decision Rule Sets from Neu- Limitations There are limitations to this architecture. ral Networks, arXiv:2103.02826 [cs] (2021). URL: The main one being that the window size is fixed and that http://arxiv.org/abs/2103.02826, arXiv: 2103.02826. it limits the size of the patterns that can be found. There is [5] R. Kusters, Y. Kim, M. Collery, C. d. S. Marie, a trade-off between the maximum size of patterns and the S. Gupta, Differentiable Rule Induction with training complexity that has to be further investigated. Learned Relational Features, arXiv:2201.06515 [cs, stat] (2022). URL: http://arxiv.org/abs/2201.06515, arXiv: 2201.06515. Conclusion To conclude, we presented a 1D- [6] C. C. Aggarwal, On effective classification of strings convolutional neural architecture to discover local and with wavelets, in: Proceedings of the eighth ACM global patterns in sequences while learning binary clas- SIGKDD international conference on Knowledge sification rules. This architecture is fully differentiable discovery and data mining, KDD ’02, Association and requires sparsity that is enforced dynamically. Its for Computing Machinery, New York, NY, USA, main limitation is its dependence to the window size 2002, pp. 163–172. URL: https://doi.org/10.1145/ parameter. Further work will consist in integrating 775047.775071. doi:10.1145/775047.775071. this block into more complex architectures to augment [7] Z. Xing, J. Pei, E. Keogh, A brief survey on sequence the expressivity of the learned rules. Moreover, the classification, ACM SIGKDD Explorations Newslet- algorithm will be tested on concrete datasets such as ter 12 (2010) 40–48. URL: https://dl.acm.org/doi/ UCI splice dataset or E. Coli promoter gene sequences 10.1145/1882471.1882478. doi:10.1145/1882471. dataset [10] to demonstrate its ability to discover rules 1882478. with learned non-trivial patterns. [8] T. Hoefler, D. Alistarh, T. Ben-Nun, N. Dryden, A. Peste, Sparsity in Deep Learning: Pruning and Acknowledgments growth for efficient inference and training in neu- ral networks, arXiv:2102.00554 [cs] (2021). URL: This work has been partially funded by the French gov- http://arxiv.org/abs/2102.00554, arXiv: 2102.00554. ernment as part of project PSPC AIDA 2019-PSPC-09, [9] T. Lin, S. U. Stich, L. Barba, D. Dmitriev, in the framework of the “Programme d’Investissement M. Jaggi, Dynamic Model Pruning with Feed- d’Avenir”. back, arXiv:2006.07253 [cs, stat] (2020). URL: http: //arxiv.org/abs/2006.07253, arXiv: 2006.07253. [10] D. Dua, C. Graff, UCI Machine Learning Repos- References itory, University of California, Irvine, School of Information and Computer Sciences, 2017. URL: [1] J. Han, M. Kamber, J. Pei, 13 - Data Mining Trends http://archive.ics.uci.edu/ml. and Research Frontiers, in: J. Han, M. Kamber, J. Pei (Eds.), Data Mining (Third Edition), The Mor- gan Kaufmann Series in Data Management Sys- tems, Morgan Kaufmann, Boston, 2012, pp. 585– A. Experiments with toy examples 631. URL: https://www.sciencedirect.com/science/ article/pii/B9780123814791000137. doi:10.1016/ The model is tested on toy synthetic datasets that are B978-0-12-381479-1.00013-7. generated to fit the rules presented in Figure 3. Datasets [2] C. Zhou, B. Cule, B. Goethals, Pattern Based are composed of 10000 sequences from which 10% is used Sequence Classification, IEEE Transactions on for validation and an other 10% for testing; the rest being Knowledge and Data Engineering 28 (2015) 1–1. used for training (batch of 100). We used Adam optimizer doi:10.1109/TKDE.2015.2510010. with a learning rate set to 0.1. In this setup, we obtain 100% accuracy in 8 out of 10 training of 200 epochs.