Pat-in-the-loop: Syntax-based Neural Networks with Activation Visualization and Declarative Control Fabio Massimo Zanzotto1[0000−0002−7301−3596] , Dario Onorati1 , Pierfrancesco Tommasino1 , Andrea Santilli1 , Leonardo Ranaldi2[0000−0001−8488−4146] , and Francesca Fallucchi2[0000−0001−5502−4358] 1 ART Group, University of Rome Tor Vergata, Rome, Italy fabio.massimo.zanzotto@uniroma2.it 2 University Gugliemo Marconi, Rome, Italy Abstract. The dazzling success of neural networks over natural lan- guage processing systems is imposing a urgent need to control their behavior with simpler, more direct declarative rules. In this paper, we propose Pat-in-the-loop as a model to control a specific class of syntax- oriented neural networks by adding declarative rules. In Pat-in-the-loop, distributed tree encoders allow to exploit parse trees in neural networks, heat parse trees visualize activation of parse trees, and parse subtrees are used as declarative rules in the neural network. A pilot study on question classification showed that declarative rules representing human knowledge can be effectively used in these neural networks.3 1 Introduction Neural networks are obtaining dazzling successes in natural language process- ing (NLP). General neural networks learned on terabytes of data are replacing decades of scientific investigations by showing unprecedented performances in a variety of NLP tasks [6]. Hence, systems based on NLP and on neural networks (NLP-NN) are everywhere. As a consequence of the success, public opinion is ex- tremely fast in spotting possibly catastrophic, unwanted behavior on deployed NLP-NN systems (see, for example, [16]). As many learned systems [4], also NLP-NN systems are exposed to biased decisions or biased production of ut- terances. This problem is becoming so important that extensive analyses are performed, for example, for the tricky class of systems for sentiment analysis [11]. To promptly recover from catastrophic failures, NLP-NN systems should be endowed with the possibility of modifying their behavior by using declarative languages to teach neural networks with a deductive teaching approach. Deduc- tive teaching is an extremely difficult task even in the human learning process [1, 15]. Active learning techniques [5] can require too many examples and may focus the attention of NLP-NN systems on irrelevant peculiarities of datasets [2]. 3 Copyright (c) 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Looking into NLP-NN systems beyond the dazzling light is becoming an ac- tive area [9, 8] since traditional neural network visualization tools are obscure when applied to NLP-NN systems. Heatmaps are powerful tools for visualizing neural networks applied to image interpretation [20]. In fact, heatmaps can vi- sualize how neural network layers treat specific subparts of images. Yet, when applied to NLP-NN systems [13] are extremely difficult to interpret. A heat parse tree Pat-in-the-loop: the overall system Fig. 2: The overall idea In this paper, we propose Pat-in-the-loop as a model to include human con- trol in specific NLP-NN systems that exploit syntactic information. The key contributions are: (1) distributed tree encoders that directly exploit parse trees in neural networks; (2) heat parse trees that visualize which parts of parse trees are responsible for the activation of specific neurons (see Figure 1a); and, (3) a declarative language for controlling the behavior of neural networks. Distributed tree encoders allow to produce heat parse trees and developers can explore ac- tivation of parse trees for specific decisions to derive rules for correcting system behavior. We performed a pilot study on question classification where Pat-in- the-loop showed that human knowledge can be effectively used to control the behavior of a syntactic NLP-NN system. 2 The Model In Pat-in-the-loop (see Figure 1b), a generic developer, which we call Pat, may inspect the reasons why her/his neural network takes some decisions. In fact, Pat’s neural network model is based on distributed tree encoders Wdt to directly exploit parse trees in neural networks (Sec. 2.2). Pat can visualize why some decisions are taken from the network according to parse trees of examples x by using “heat parse trees” (Sec. 2.1 and Sec. 2.3). Hence, Pat can control the behavior of neural networks with declarative rules represented as subtrees by encoding these rules in WH (Sec. 2.4). 2.1 Preliminary notation Parse trees and heat parse trees are core representations in our model. This section introduces the notation to describe these two representations. Parse trees T and parse subtrees τ are recursively represented as trees t = (r, [t1 , ..., tk ]) where r is the label representing the root of the tree and [t1 , ..., tk ] is the list of child trees ti . Leaves t are represented as trees t = (r, []) with an empty list of children or directly as t = r. Heat parse trees, similarly to “heat trees” in biology [7], are heatmaps over parse trees (see Figure 1a). The underlying representation is an active tree t, that is, a tree where an activation value vr ∈ R is associated to each node: t = (r, vr , [t1 , ..., tk ]). Heat parse trees are graphical visualization of active trees t where colors and sizes of nodes r depend on their activation values vr . 2.2 Distributed Tree Encoders for Exploiting Parse Trees in Neural Networks Distributed tree encoders are the encoders used in Pat-in-the-loop to directly ex- ploit parse trees in neural networks. These encoders, stemming from distributed tree kernels [18], give the possibility to represent parse trees in vector spaces Rd that embed huge spaces of subtrees Rn . These encoders may be seen as linear transformations Wdt ∈ Rd×n (similarly to Johnson-Lindenstrauss Transforma- tion [10]). These linear transformations embed vectors xT ∈ Rn in the space of tree kernels in smaller vectors y T ∈ Rd : y T = Wdt xT Columns wi of Wdt encode subtree τ (i) and are computed with an encoding function wi = E(τ (i) ) as follows: if τ (i) = (r, [])  r E(τ (i) ) = (i) (i) (i) (i) r ⊗ E(τ1 ) ⊗ ... ⊗ E(τk ) if τ (i) = (r, [τ1 , ..., τk ]) where: the operation u ⊗ v is the shuffled circular convolution, that is, a circular convolution ? with a permutation matrix Φ: u ⊗ v = u ∗ Φv; and, r ∼ N (0, √1d I) is drawn from a multivariate gaussian distribution. As for tree kernels also for distributed tree encoders, linear transformations Wdt and vectors xT ∈ Rn are never explicitly produced and encoders are implemented as recursive functions [18]. 2.3 Visualizing Activation of Parse Trees Distributed tree encoders give the possibility of using heat parse trees to visualize the activation of parse trees in final decisions or intermediate neuron outputs. To compute of active trees t useful to produce heat parse trees, a neural net- work should be sliced at the desired layer. Let NN be the sliced neural network, x = xT , xr and o its output: o = NN(Wdt xT , xr ) where, given an example x, xT is the vector representing the tree T in the space of subtrees related to the example x, Wdt is the distributed tree encoder, and xr is the rest of the features associated to x. Our heat parse trees show the overlap of activation of subtrees in S(T ) of specific trees T related to a specific example x in a specific net. This shows how subtrees in S(T ) contribute to the final activation oi , that is, a dimension of o. We believe this is more convenient than representing an extremely large heatmap for the list of subtrees in S(T ) and their related value oi . The computation of active trees t for displaying heat parse trees is the fol- lowing. The activation weight vr of each node r represents how much the node is responsible for the activation of the overall syntactic tree for the output of the given neuron oi . Then, the activation value vr is computed as follows: X |τ | vr = NN(Wdt λ 2 τ , xr ) τ ∈S(T ) and r∈τ where τ is the one-hot vector in the subtree space that indicates the subtree τ and r ∈ τ detects in r is node in τ . With the above computation of t, active subtrees τ for the output oi of a specific neuron are overlapped in single heat parse trees. f-measure micro avg macro avg BoW 0.84 0.84 PureNN 0.93 0.91 HumNN 0.93 0.92 Declarative Rules class rule ABBR (NP (NP (DT) (JJ full) (NN)) (PP (IN))) ABBR (SQ (VBZ) (NP) (VP (VB stand) (PP (IN for)))) ABBR (NN abbrevation) ABBR (VP (VB mean)) NUM (WHNP (WDT What) (NNS debts)) NUM (NP (NP (NNP)(NNP)(POS))(NN)) PureNN HumNN ABBR ENTY DESC HUM LOC NUM ABBR ENTY DESC HUM LOC NUM ABBR 6 0 3 0 0 0 ABBR 7 0 2 0 0 0 ENTY 0 84 3 2 4 1 ENTY 0 83 5 3 2 1 DESC 0 5 133 0 0 0 → DESC 0 3 135 0 0 0 HUM 0 1 1 63 0 0 HUM 0 3 0 62 0 0 LOC 0 1 1 2 76 1 LOC 0 4 1 1 74 1 NUM 0 5 5 0 1 102 NUM 0 3 4 1 2 103 Table 1: Pat-in-the-loop’s performances, discovered declarative rules and confusion ma- trices on QC before and after human knowledge use 2.4 Human-in-the-loop Layer Pat has now an important possibility of understanding why decisions are taken by a specific network and, hence, s/he can define specific rules to control the behavior of the neural network. For example, the heat parse tree in Figure 1a suggests that the subtree (SQ,[VBD,NP,VP]) is the more active in generating the decision if this is taken for the output of a neuron that represents a final class. If Pat aims to correct the system’s behavior for a given output, s/he may select the specific subtree τ and insert E(τ ) as a row in matrix WH that embeds declarative rules (see Figure 1b). This specific rule is then reused to retrain the neural network and should change decisions for examples of the same kind. 3 Pilot Experiment We experimented with Pat-in-the-loop by using the coarse grain classification problem of the question classification dataset [14], which contains 5,242 training questions and 500 testing questions. The dataset is well studied. Hence, it offers a very intriguing possibility to run an experiment where a human in the loop can make the difference in calibrating the overall system. 3.1 Experimental set-up The Pat-in-the-loop (see Figure 1b) of the experiments has the following configu- ration. Distributed trees W dt xT are encoded in a space Rd with d = 4, 000. The decaying factor of tree kernels is λ = 0.6. The module NN(W dt xT , xr ) is a multi- layer perceptron that combines two multi-layer perceptrons: Synt(W dt xT ) and Sem(xr ). Synt exploit syntactic information and its output is 1,800. Sem ex- ploits a Bag-of-Word model of the input with word embedding input of 300 from fastText [3] and output of 180. Synt and Sem are concatenated and feed a multi- layer perceptron with two layers: 100 and 6. We used a ReLU activation function among layers. The last activation function is a softmax. All experiments were run for 20 epochs in Keras. Finally, we used the CoreNLP constituent-based parser [12] for parsing questions. We compared three systems: BoW that contains only the word embedding used as a bag-of-word; PureNN that is the system without human knowledge; and HumNN that is the full system with Pat’s declarative knowledge. We per- formed a 3-fold cross validation with the training set to accumulate misclassified examples for the human learning loop. Pat inspected these examples with heat parse trees and encoded some declarative rules in WH (see Tab. 1). 3.2 Results and discussion Results in Table 1 shows the following important facts. First, distributed tree en- coders positively introduce syntactic information in neural networks: 0.84 to 0.93 of improvement in f-measure from BoW to PureNN (Table 1). Second, global results of the model with human knowledge (HumNN ) are similar and even slightly higher than those of PureNN. Micro-average is 0.93 for both models and macro average is 0.92 for HumNN with respect to 0.91 of PureNN. Third, Pat could change the behavior of the system where he wanted. Since Pat aimed to manipulate the behavior of the system in favor of the classes ABBR and NUM, s/he focused the attention to examples where PlainNN fails. Pat’s rules coded in WH . After learning the new model HumNN disturbed by human declara- tive knowledge, results on the test set are encouraging. In fact, although the overall performance is unchanged, target classes have had positive improvement. Both ABBR and NUM have an additional positively classified example. This tiny improvement suggests that the model can positively use declarative human knowledge. Finally, heat parse trees are informative. In fact, Pat could under- stand why some specific cases were misclassified and could select declarative rules to change the behavior of the system. Globally, results of the pilot experiment confirmed our hypothesis: human can positively manipulate the system by inducing rules from the training set. 4 Conclusions and Future Work In the line of understanding neural networks and trying to control their behav- ior besides using training examples, we presented Pat-in-the-loop. Our model exploits syntactic information in neural networks by using distributed tree en- coders, visualizes activation of syntactic information with heat parse trees, and encode declarative knowledge in a neural network. Encouraging results on a pilot study are a first “declarative pat" on neural networks applied to natural language processing, which may open a wide range of possible researches. By leveraging on recent results obtained with KERMIT [19], we aim to assess results of Pat-in-the-loop and envisage novel ways to include declarative control in these specific neural networks. Endowing neural networks with declarative control may help in clarifying who is giving knowledge to these systems. In this way, we could devise machine learning models that can repay their “teachers” [17]. References 1. Agrusti, G., Damiani, V., Pasquazi, D., Carta, P.: Reading mathematics at school. inferential reasoning on the pythagorean theorem [leggere la matematica a scuola. percorsi inferenziali sul teorema di pitagora]. Cadmo 23(1), 61–85 (2015). https://doi.org/10.3280/cad2015-001007 2. Allen, G.: Machine learning: The view from statistics. In: Proceedings of the AAAS Annual Meeting (2019) 3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL 5, 135–146 (2017), http://aclweb.org/anthology/ Q17-1010 4. Courtland, R.: Bias detectives: the researchers striving to make algorithms fair. Nature 558, 357–360 (Jun 2018). https://doi.org/10.1038/d41586-018-05469-3 5. Dasgupta, S.: Analysis of a greedy active learning strategy. In: Ad- vances in NeurIPS. MIT Press (2005), http://papers.nips.cc/paper/ 2636-analysis-of-a-greedy-active-learning-strategy.pdf 6. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805 7. Foster, Z.S.L., Sharpton, T.J., Grünwald, N.J.: Metacoder: An R package for vi- sualization and manipulation of community taxonomic diversity data. PLoS Com- putational Biology 13(2) (2017). https://doi.org/10.1371/journal.pcbi.1005404, https://doi.org/10.1371/journal.pcbi.1005404 8. Jacovi, A., Shalom, O.S., Goldberg, Y.: Understanding Convolutional Neural Net- works for Text Classification pp. 56–65 (2018). https://doi.org/doi:10.1046/j.1365- 3040.2003.01027.x, http://arxiv.org/abs/1809.08037 9. Jang, K.r., Kim, S.b., Corp, N.: Interpretable Word Embedding Contextualization pp. 341–343 (2018) 10. Johnson, W., Lindenstrauss, J.: Extensions of lipschitz mappings into a hilbert space. Contemp. Math. 26, 189–206 (1984) 11. Kiritchenko, S., Mohammad, S.: Examining gender and race bias in two hun- dred sentiment analysis systems. In: Proceedings of *SEM (2018), https:// aclanthology.info/papers/S18-2005/s18-2005 12. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of ACL. pp. 423–430 (2003). https://doi.org/10.3115/1075096.1075150, http://dx.doi. org/10.3115/1075096.1075150 13. Li, J., Chen, X., Hovy, E., Jurafsky, D.: Visualizing and understanding neural mod- els in nlp. In: Proceedings of NAACL (2016). https://doi.org/10.18653/v1/N16- 1082, http://aclweb.org/anthology/N16-1082 14. Li, X., Roth, D.: Learning question classifiers. In: Proceedings of the COLING. ACL, Stroudsburg, PA, USA (2002). https://doi.org/10.3115/1072228.1072378, http://dx.doi.org/10.3115/1072228.1072378 15. Pasquazi, D.: Capacità sensoriali e approccio intuitivo-geometrico nella preadolescenza: Un’indagine nelle scuole. Cadmo 2020(1), 79–96 (2020). https://doi.org/10.3280/CAD2020-001006 16. Thompson, A.: Google’s sentiment analyzer thinks being gay is bad. MOTHER- BOARD (Oct 2017), https://motherboard.vice.com/en_us/article/j5jmj8/ google-artificial-intelligence-bias 17. Zanzotto, F.M.: Viewpoint: Human-in-the-loop artificial intelligence. J. Artif. In- tell. Res. 64, 243–252 (2019). https://doi.org/10.1613/jair.1.11345, https://doi. org/10.1613/jair.1.11345 18. Zanzotto, F.M., Dell’Arciprete, L.: Distributed tree kernels. In: Proceedings of ICML (2012) 19. Zanzotto, F.M., Santilli, A., Ranaldi, L., Onorati, D., Tommasino, P., Fallucchi, F.: Kermit: Complementing transformer architectures with encoders of explicit syntac- tic interpretations. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (2020) 20. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: ECCV. pp. 818–833. Cham (2014)