=Paper=
{{Paper
|id=Vol-2350/paper23
|storemode=property
|title=KINN: Incorporating Expert Knowledge in Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2350/paper23.pdf
|volume=Vol-2350
|authors=Muhammad Ali Chattha,Shoaib Ahmed Siddiqui,Muhammad Imran Malik,Ludger Van Elst,Andreas Dengel ,Sheraz Ahmed
|dblpUrl=https://dblp.org/rec/conf/aaaiss/ChatthaSMEDA19
}}
==KINN: Incorporating Expert Knowledge in Neural Networks ==
KINN: Incorporating Expert Knowledge in Neural Networks Muhammad Ali Chattha123 , Shoaib Ahmed Siddiqui12 , Muhammad Imran Malik34 , Ludger van Elst1 , Andreas Dengel12 , Sheraz Ahmed1 1 German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany. 2 TU Kaiserslautern, Kaiserslautern, Germany. 3 School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan. 4 Deep Learning Laboratory, National Center of Artificial Intelligence, Islamabad, Pakistan. Abstract ing (Conneau et al. 2017) to speech recognition (Hinton et al. 2012). The biggest highlight of which was perhaps The ability of Artificial Neural Networks (ANNs) to learn ac- curate patterns from large amount of data has spurred inter- Google DeepMind’s AlphaGo system, beating one of the est of many researchers and industrialists alike. The promise world’s best Go player, Lee Sedol in a 5 series match (Wang of ANNs to automatically discover and extract useful fea- et al. 2016). Consequently, the idea of superseding human tures/patterns from data without dwelling on domain exper- performance has opened a new era of research and interest in tise although seems highly promising but comes at the cost artificial intelligence. However, the success of DNNs over- of high reliance on large amount of accurately labeled data, shadows its limitations. Arguably the most severe limitation which is often hard to acquire and formulate especially in is its high reliance on large amount of accurately labeled time-series domains like anomaly detection, natural disas- data which in many applications is not available (Sun et al. ter management, predictive maintenance and healthcare. As 2017). This is specifically true in domains like anomaly de- these networks completely rely on data and ignore a very im- tection, natural disaster management and healthcare. More- portant modality i.e. expert, they are unable to harvest any benefit from the expert knowledge, which in many cases is over, training a network solely on the basis of data may re- very useful. In this paper, we try to bridge the gap between sult in poor performance on examples that are not or less these data driven and expert knowledge based systems by in- often seen in the data and may also lead to counter intuitive troducing a novel framework for incorporating expert knowl- results (Szegedy et al. 2013). edge into the network (KINN). Integrating expert knowledge Humans tend to learn from examples specific to the prob- into the network has three key advantages: (a) Reduction in lem, similar to DNNs, as well as from different sources the amount of data needed to train the model, (b) provision of knowledge and experiences (Lake, Salakhutdinov, and of a lower bound on the performance of the resulting clas- Tenenbaum 2015). This makes it possible for humans to sifier by obtaining best of both worlds, and (c) improved convergence of model parameters (model converges in lesser learn just from acquiring knowledge about the problem with- number of epochs). Although experts are extremely good in out even looking at the data pertaining to it. Domain experts solving different tasks, there are some trends and patterns, are quite proficient in tasks belonging to their area of ex- which are usually hidden only in the data. Therefore, KINN pertise due to their extensive knowledge and understanding employs a novel residual knowledge incorporation scheme, of the problem, which they have acquired overtime through which can automatically determine the quality of the predic- relevant education and experiences. Hence, they rely on their tions made by the expert and rectify it accordingly by learning knowledge when dealing with problems. Due to their deep the trends/patterns from data. Specifically, the method tries to insights, expert predictions even serve as a baseline for mea- use information contained in one modality to complement in- suring the performance of DNNs. Nonetheless, it can not formation missed by the other. We evaluated KINN on a real be denied that apart from knowledge, the data also contains world traffic flow prediction problem. KINN significantly su- perseded performance of both the expert and as well as the some useful information for solving problems. This is par- base network (LSTM in this case) when evaluated in isola- ticularly cemented by astonishing results achieved by the tion, highlighting its superiority for the task. DNNs that solely rely on data to find and utilize hidden fea- tures contained in the data itself (Krizhevsky, Sutskever, and Deep Neural Networks (DNNs) have revolutionized the Hinton 2012). domain of artificial intelligence by exhibiting incredible Therefore, a natural step forward is to combine both these performance in applications ranging from image classifi- separate streams of knowledge i.e. knowledge extracted cation (Krizhevsky, Sutskever, and Hinton 2012), playing from the data and the expert’s knowledge. As a matter of board games (Silver et al. 2016), natural language process- fact, supplementing DNNs with expert knowledge and pre- Copyright held by the author(s). In A. Martin, K. Hinkelmann, A. dictions in order to improve their performance has been ac- Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of tively researched upon. A way of sharing knowledge among the AAAI 2019 Spring Symposium on Combining Machine Learn- classes in the data has been considered in zero-shot-learning ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford (Rohrbach, Stark, and Schiele 2011), where semantic relat- University, Palo Alto, California, USA, March 25-27, 2019. edness among classes is used to find classes related to the known ones. Although such techniques employ knowledge termediate conclusions. The network is designed to have a transfer, they are restricted solely to the data domain and one-to-one correspondence with the elements of the rule set, the knowledge is extracted and shared from the data itself where neurons and the corresponding weights of their con- without any intervention from the expert. Similarly, expert nections are specified by the rules. Apart from these rule knowledge and opinions are incorporated using distillation based connections and neurons, additional neurons are also technique where expert network produces soft predictions added to learn features not specified in the rule set. Similar that the DNN tries to emulate or in the form of posterior approach has also been followed by (Tran and Garcez 2018). regularization over DNN predictions (Hinton, Vinyals, and Although such approaches directly incorporates knowledge Dean 2015). All of these techniques try to strengthen DNN into the network, but they also limit the network architec- with expert knowledge. However, cases where the expert ture by forcing it to have strict correspondence with the rule model is unreliable or even random have not been consid- base. As a result, this restricts the use of alternate architec- ered. Moreover, directly trying to mimic expert network pre- tures or employing network that does not directly follow the dictions has an implicit assumption regarding the high qual- structure defined by the rule set. ity of the predictions made by the expert. We argue that (Hu et al. 2016) integrated expert knowledge using first the ideal incorporation of expert network would be the one order logic rules which is transferred to the network pa- where strengths of both networks are promoted and weak- rameters through iterative knowledge distillation (Hinton, nesses are suppressed. Hence, we introduce a step in this Vinyals, and Dean 2015). The DNN tries to emulate soft direction by proposing a novel framework, Knowledge In- predictions made by the expert network, instilling expert tegrated Neural Network (KINN), which aims to integrates knowledge into the network parameters. Hence, the expert knowledge residing in heterogeneous sources, in the form of network acts as a teacher to the DNN i.e. the student net- predictions, in a residual scheme KINN’s design allows it work. The objective function is taken as a weighted average to be flexible. KINN can successfully integrate knowledge between imitating the soft predictions made by the teacher in cases where predictions of the expert and DNN align and network and true hard label predictions. The teacher network as well as in scenarios where they are completely disjoint. is also updated at each iteration step with the goal of find- Finding state-of-the-art DNN or expert model is not the aim ing the best teacher network that fits the rule set while, at here but rather, the aim is to devise a strategy that facilitates the same time, also staying close to the student network. In integration of expert knowledge with DNNs in a way that order to achieve this goal, KL-divergence between the prob- the final network achieves best of both worlds. ability distribution of the predictions made by the teacher The residual scheme employed in KINN to incorporate network and softmax output layer of the student network is expert knowledge inside the network has three key advan- used as the objective function to be minimized. This acts tages: (a) Significant reduction in the amount of data needed as a constraint over model posterior. The proposed frame- to train the model, since the network has to learn a resid- work was evaluated for classification tasks and achieved su- ual function instead of learning the complete input to output perior results compared to other state-of-the-art models at space projection, (b) a lower bound on the performance of that time. However, the framework strongly relies on the ex- KINN based on the performance of the two subsequent clas- pert network for parametric optimization and does not cater sifiers achieving the best of both worlds, and (c) improve- for cases where expert knowledge is not comprehensive. ments in convergence of the model parameters as learning Expert knowledge is incorporated for key phrase extrac- a residual mapping makes the optimization problem signif- tion by (Gollapalli, Li, and Yang 2017) where they defined icantly easier to tackle. Moreover, since the DNN itself is label-distribution rules that dictates the probability of a word data driven, this makes KINN robust enough to deal with being a key phrase. For example, the rule enunciates that situations where the predictions made by the expert model a noun that appears in the document as well as in the title are not reliable or even useless. is 90% likely to be a key phrase and thus acts as posterior The rest of the paper is structured as follows: We first regularization providing weak supervision for the classifica- provide a brief overview of the work done in the direction tion task. Similarly, KL-divergence between the distribution of expert knowledge incorporation in the past. We then ex- given by the rule set and the model estimates is used as the plain the proposed framework, KINN, in detail. After that, objective function to be used for the optimization. Again, as we present the evaluation results regarding the different ex- the model utilizes knowledge to strengthen the predictions periments performed in order to prove the efficacy of KINN of the network, it shifts the dependency of the network from for the task of expert knowledge incorporation. Finally, we the training data to accurate expert knowledge which might conclude the paper with the conclusion. just be an educated guess in some cases. Similarly, (Xu et al. 2017) incorporated symbolic knowledge into the network Related Work by deriving a semantic loss function that acts as a bridge be- Integrating domain knowledge and experts opinion into the tween the network outputs and the logical constraints. The network is an active area of research and even dates back to semantic loss function is based on constraints in the form the early 90s. Knowledge-based Artificial Neural Networks of propositional logic and the probabilities computed by the (KBANN) was proposed by (Towell and Shavlik 1994). network. During training, the semantic loss is added to the KBANN uses knowledge in the form of propositional rule normal loss of the network and thus acts as a regularization sets which are hierarchically structured. In addition to di- term. This ensures that symbolic knowledge plays a part in rectly mapping inputs to outputs, the rules also state in- updating the parameters of the network. (Wu et al. 2016) proposed a Knowledge Enhanced Hy- brid Neural Network (KEHNN). KEHNN utilizes knowl- edge in conjunction with the network to cater for text match- ing in long texts. Here, knowledge is considered to be the global context such as topics, tags etc. obtained from other algorithms that extracts information from multiple sources and datasets. They employed the twitter LDA model (Zhao et al. 2011) as the prior knowledge which was consid- ered useful in filtering out noise from long texts. A spe- cial gate known as the knowledge gate is added to the tra- ditional bi-directional Gated Recurrent Units (GRU) in the model which controls how much information from the ex- pert knowledge flows into the network. KINN: The Proposed Framework Figure 1: Traffic flow data grouped into 30 minute windows Problem Formalization Time-series forecasting is of vital significance due to its high Dataset impact, specifically in domains like supply chain (Fildes, We evaluated KINN on Caltrans Performance Measurement Goodwin, and Onkal 2015), demand prediction (Pacchin System (PeMS) data. The data contains records of sensor et al. 2017), and fault prediction (Baptista et al. 2018). readings that measure the flow of vehicular traffic on Cali- In a typical forecasting setting, a sequence of values fornia Highways. Since the complete PeMS dataset is enor- {xt−1 , xt−2 , ..., xt−p } from the past are used to predict the mous in terms of its size comprising of records from mul- value of the variable at time-step t, where p is the number tiple highways, we only considered a small fraction of it of past values leveraged for a particular prediction, which for our experiments i.e. the traffic flow on Richards Ave, we refer as the window size. Hence, the model is a func- from January 2016 till March 20161 . The dataset contains tional mapping from past observations to the future value. information regarding the number of vehicles passing on the This parametric mapping can be written as: avenue every 30 seconds. PeMS also contains other details regarding the vehicles, however, we only consider the prob- x̂t = φ([xt−1 , xt−2 , ..., xt−p ]; W) lem of average traffic flow forecasting in this paper. The data L is grouped into 30 minute windows. The goal is to predict where W = {Wl , bl }l=1 encapsulates the parameters of the average number of vehicles per 30 seconds for the next 30 network and φ : Rp 7→ R defines the map from the in- minutes. Fig. 1 provides an overview of the grouped dataset. put space to the output space. The optimal parameters of The data clearly exhibits a seasonal component along with the network W ∗ are computed based on the empirical risk high variance for the peaks. computed over the training dataset. Using MSE as the loss function, the optimization problem can be stated as: Baseline Expert and Deep Models 1 X LSTMs have achieved state-of-the-art performance in a W ∗ = arg min (xt − φ([xt−1 , ..., xt−p ]; W))2 range of different domains comprising of sequential data W |X | such as language translation (Weiss et al. 2017), and hand- x∈X (1) writing and speech recognition (Zhang et al. 2018; Chiu et where X denotes the set of training sequences and x ∈ al. 2018). Since we are dealing with sequential data, hence, Rp+1 . Solving this optimization problem comprising of LSTM was a natural choice as our baseline neural net- thousands, if not millions of parameters, requires large work model. Although the aim of this work is to develop amount of data in order to successfully constrain the para- a technique capable of fusing useful information contained metric space so that a reliable solution is obtained. in two different modalities, irrespective of their details, still Humans on the other hand, leverage their real-world we spent significant compute time to discover the optimal knowledge along with their past-experiences in order to network hyperparameters. This is done through grid-search make predictions about the future. The aim of KINN is to in- confined to a reasonable hyperparameter search space. The ject this real-world knowledge in the form of expert into the hyperparameter search space included number of layers in system. However, as mentioned, information from the ex- the network, number of neurons in each layer, activation pert may not be reliable, therefore, KINN proposes a novel function for each layer, along with the window size p. residual learning framework for the incorporation of expert Partial auto-correlation of the series was also analyzed to knowledge into the system. The residual framework condi- identify association of the current value in the time-series tions the prediction of the network on the expert’s opinion. with its lagged version as shown in Fig. 2. As evident from As a result, the network acts as a correcting entity for the the figure, the series showed strong correlation with its past values generated by the expert. This decouples our system 1 from complete reliance on the expert knowledge. http://www.stat.ucdavis.edu/~clarkf/ as the expert network seems plausible as shown in Fig. 4(a). However, it is only through thorough inspection and inves- tigation on a narrower scale that strengths and weaknesses of each of the networks are unveiled as shown in Fig. 4(b). The LSTM tends to capture the overall trend of the data but suffered when predicting small variations in the time-series. SARIMA on the other hand was more accurate in predict- ing variations in the time-series. In terms of MSE, LSTM model performed considerably worse when compared to the expert model. For this dataset, the discovered LSTM model achieved a MSE of 5.90 compared to 1.24 achieved by SARIMA on the test set. KINN: Knowledge Integrated Neural Network Figure 2: Partial auto-correlation of time-series Most of the work in the literature (Hu et al. 2016; Gollapalli, Li, and Yang 2017) on incorporating expert knowledge into the neural network focuses on training the network by forc- ing it to mimic the predictions made by the expert network, ergo updating weights of the network based on the expert’s information. However, they do not cater for a scenario where expert network does not contain information about all possi- ble scenarios. Moreover, these hybrid knowledge based net- work approaches are commonly applied to the classification scenario where output vector of the network corresponds to a probability distribution. This allows KL-divergence to be used as the objective function to be minimized in order to match predictions of the network and the expert network. In case of time-series forecasting, the output of the network is a scalar value instead of a distribution which handicaps most of the prior frameworks proposed in the literature. The KINN framework promotes both the expert model as well as the network to complement each other rather than directly mimicking the expert’s output. This allows KINN to Figure 3: Neural network architecture successfully tackle cases where predictions from the expert are not reliable. Finding the best expert or neural network is not the focus here but instead, the focus is to incorporate three values. This is also cemented by the result of the grid- expert prediction, may it be flawed, in such a way that the search that chose the window size of three. The final net- neural network maintains its strengths while incorporating work consisted of three hidden LSTM layers followed by strengths of the expert network. a dense regression layer. Apart from the first layer, which There are many different ways through which knowledge used sigmoid, Rectified Linear Unit (ReLU) (Glorot, Bor- between an expert and the network can be integrated. Let des, and Bengio 2011) was employed as the activation func- x̂pt ∈ R be the prediction made by the expert. We incor- tion. Fig. 3 shows the resulting network architecture. The porate the knowledge from the expert in a residual scheme data is segregated into train, validation and test set using inspired by the idea of ResNet curated by (He et al. 2016). 70/10/20 ratio. MSE was employed as the corresponding Let φ : Rp+1 7→ R define the mapping from the input space loss function to be optimized. The network was trained for to the output space. The learning problem from Eq. 1 after 600 epochs and the parameters producing the best validation availability of the expert information can be now be written score were used for generating predictions on the test set. as: Auto-Regressive Integrated Moving Average (ARIMA) is widely used by experts in time-series modelling and analy- x̂t = φ([xt−1 , xt−2 , ..., xt−p , x̂pt ]; W) + x̂pt sis. Therefore, we employed ARIMA as the expert opinion in our experiments. Since the data demonstrated a signifi- 1 X cant seasonal component, the seasonal variant of ARIMA W ∗ = arg min (xt − (φ([xt−1 , ..., xt−p , x̂pt ]; W) W |X | (SARIMA) was used, whose parameters were estimated us- x∈X ing the Box-Jenkins approach (Box et al. 2015). Fig. 4 +x̂pt ))2 demonstrates the predictions obtained by employing the (2) LSTM model as well as the expert (SARIMA) model on the Instead of computing a full input space to output space trans- test set. form as in Eq. 1, the network instead learns a residual func- The overall predictions made by both the LSTM as well tion. This residual function can be considered as a correction (a) Predictions over the whole test set (b) Predictions over the first 100 steps Figure 4: Predictions of NN and Expert Network term to the prediction made by the expert model. Since the employed in the first experiment. In the first case, we re- model is learning a correction term for the expert’s predic- duced the amount of training data provided to the models tion, it is essential for the model prediction to be conditioned for training. We present the findings from this experiment in on the expert’s prediction as indicated in Eq. 2. There are experiment # 02. In the second case, we reduced the reliabil- two simple ways to achieve this conditioning for the LSTM ity of the expert predictions by injecting random noise. The network. The first one is to append the prediction at the end results from this experiment are summarized in experiment of the sequence as indicated in the equation. Another pos- # 03. A direct extension of the last two experiments is to sibility is to stack a new channel to the input with repeated evaluate KINN’s performance in cases where both of these values for the expert’s prediction. The second case makes the conditions hold i.e. the amount of training data is reduced optimization problem easier as the network has direct access as well as the expert is noisy. We summarize the results for to the expert’s prediction at every time-step. Therefore, re- this experiment in experiment # 04. Finally, we evaluated sults in minor improvements in terms of MSE. The system KINN’s performance in cases where the expert contained no architecture for KINN is shown in Fig. 5. information. We achieved this using two different ways. We Incorporating expert knowledge in this residual fashion first evaluated the case where the expert always predicted serves a very important purpose in our case. In cases where the value of zero. In this case, the target was to evaluate the the expert’s predictions are inaccurate, the network can gen- impact (if any) of introducing the residual learning scheme erate large offsets in order to compensate for the error, while since the amount of information presented to the LSTM net- the network can essentially output zero in cases where the work was exactly the same as the isolated LSTM model in expert’s predictions are extremely accurate. With this flexi- the first experiment. We then tested a more realistic scenario, bility built into the system, the system can itself decide its where the expert model replicated the values from the last reliance on the expert’s predictions. time-step of the series. We elaborate the findings from this experiment (for both settings) in experiment # 05. Evaluation Experiment # 01: Full training set and accurate We curated a range of different experiments each employing KINN in a unique scenario in order to evaluate its perfor- expert mance under varied conditions. We compare KINN results We first tested both the LSTM as well as the expert model with the expert as well as the DNN in terms of performance in isolation in order to precisely capture the impact of in- to highlight the gains achieved by employing the residual troducing the residual learning scheme. KINN demonstrated learning scheme. To ensure a fair comparison, all of the pre- significant improvements in training dynamics directly from processing and LSTM hyperparameters were kept the same the start. KINN converged faster as compared to the isolated when the model was tested in isolation and when integrated LSTM. As opposed to the isolated LSTM which required as the residual function in KINN. more training time (epochs) to converge, KINN normally In the first setting, we tested and compared KINN’s per- converged in only one fouth of the epochs taken by the iso- formance in the normal case where the expert predictions lated LSTM, which is a significant improvement in terms are accurate and the LSTM is trained on the complete train- of the compute time. Apart from the compute time, KINN ing set available. We present the results from this normal achieved a MSE of 0.74 on the test set. This is a very sig- case in experiment # 01. In order to evaluate KINN’s per- nificant improvement in comparison to the isolated LSTM formance in cases where the amount of training data avail- model that had a MSE of 5.90. Even compared to the expert able is small or the expert is inaccurate, we established two model, KINN demonstrated a relative improvement of 40% different sets of experiments starting from the configuration in terms of MSE. Fig. 6 showcases the predictions made by Figure 5: Proposed Architecture MSE Experiment Description % of training data used DNN Expert Network KINN 1 Full training set and accurate expert 100 5.90 1.24 0.74 2 Reduced training set (50%) and accurate expert 50 6.36 1.52 0.89 Reduced training set (10%) and accurate expert 10 6.68 2.67 1.53 3 Full training set and noisy expert 100 5.90 7.81 3.09 4 Reduced training set and noisy expert 10 6.68 7.81 3.73 5 Full training set and Zero expert pred. 100 5.90 621.00 5.92 Full training set and Delayed expert pred. 100 5.90 9.04 5.91 Table 1: MSE on the test set for the experiments performed (a) Predictions of all models (b) Step-wise error plot Figure 6: Predictions and the corresponding error plot for the normal case (experiment # 01) KINN along with the isolated LSTM and the expert network the expert model achieved MSE of 2.67. KINN on the other on the test set. It is evident from the figure that KINN caters hand, still outperformed both of these models and achieved for the weaknesses of each of the two models involved using a MSE of 1.53. the information contained in the other. The resulting predic- tions are more accurate than the expert network on minimas Experiment # 03: Full training set and noisy expert and also captures the small variations in the series which In all of the previous experiments, the expert model was rel- were missed by the LSTM network. atively better compared to the LSTM model employed in our In order to further evaluate the results, error at each time- experiments. The obtained results highlights KINN’s ability step is compared for the isolated models along with KINN. to capitalize over the information obtained from the expert To aid the visualization, step-wise error for first 100 time- model to achieve significant improvements in its prediction. steps of the test set is shown in Fig. 6. The plot shows that KINN also demonstrated amazing generalization despite of the step-wise prediction error of KINN is less than both the drastic reduction in the amount of training data, highlighting expert model as well as the LSTM for major portion of the KINN’s ability to achieve accurate predictions in low data time. regimes. However, in conjunction to reducing dependency However, there are instances where predictions made by of the network on data, it is also imperative that the net- KINN are slightly worse than those of the baseline models. work does not become too dependent on the expert knowl- In particular, the prediction error of KINN exceeded the er- edge making it essential to be accurate/perfect. This is usu- ror of the expert network for only 30% of the time-steps and ally not catered for in most of the prior work. We believe that only 22% of the time-steps in case of the LSTM network. the proposed residual scheme enabled the network to handle Nevertheless, even in those instances, the performance of erroneous expert knowledge efficiently by allowing it to be KINN was still on par with the other models since on 99% smart enough to realize weaknesses in the expert network of the time-steps, the difference in error is less than 1.5. and adjust accordingly. In order to verify KINN’s ability to adjust with poor predictions from the expert, we performed Experiment # 02: Reduced training set and another experiment where random noise was injected into accurate expert the predictions from the expert network. This random noise One of the objectives of KINN was to reduce dependency degraded the reliability of the expert predictions. To achieve of the network on large amount of labelled data. We argue this, random noise within one standard deviation of the aver- that the proposed model not only utilizes expert knowledge age traffic flow was added to the expert predictions. As a re- to cater for shortcomings of the network, but also helps in sult, the resulting expert predictions attained a MSE of 7.81 significantly reducing its dependency on the data. To further which is considerably poor compared to that of the LSTM evaluate this claim, a series of experiments were performed. (5.90). We then trained KINN using these noisy expert pre- KINN was trained again from scratch using only 50% of dictions. Fig. 9 visualizes the corresponding prediction and the data in the training set. The test set remained unchanged. error plots. Similarly, the LSTM network was also trained with the same As evident from Fig. 9(a), KINN still outperformed both 50% subset of the training set. the expert as well as the LSTM with a MSE of 3.09. De- The LSTM network trained on the 50% subset of the spite the fact that neither the LSTM, nor the expert model training data attained a MSE of 6.36 which is slightly worse was accurate, KINN still managed to squeeze out useful in- than the MSE of network trained on the whole training set. formation from both modalities to construct an accurate pre- Minor degradation was also observed in the performance of dictor. This demonstrates true strength of KITNN as it not the expert network which achieved a MSE of 1.52. Despite only reduces dependency of the network on the data but also of this reduction in the dataset size, KINN achieved signifi- adapts itself in case of poorly made expert opinions. KINN cantly better results compared to both the LSTM as well as achieved a significant reduction of 48% in the MSE of the the expert model achieving a MSE of 0.89. Fig 7 visualizes LSTM network by incorporating the noisy expert prediction the corresponding prediction and error plots of the models in the residual learning framework. trained on 50% subset of the training data. We performed the same experiment again with a very Experiment # 04: Reduced training set and noisy drastic reduction in the training dataset size by using only expert 10% subset of the training data. Fig. 8 visualizes the results As a natural followup to the last two experiments, we intro- from this experiment in the same way, by first plotting the duced both conditions at the same time i.e. reduced training predictions from the models along with the error plot. It is set size and noisy predictions from the expert. The train- interesting to note that since the LSTM performed consider- ing set was again reduced to 10% subset of the training data ably poor due to extremely small training set size, the net- for training the model while keeping the testing set intact. work shifted its focus to the predictions of the expert net- Fig. 10 demonstrates that despite this worst condition, KINN work and made only minor corrections to it as evident from still managed to outperform both the LSTM as well as the Fig. 8(a). This highlights KINN’s ability to decide its re- noisy expert predictions. liance on the expert predictions based on the quality of the information. In terms of the MSE, LSTM model performed Experiment # 05: Full training set and poor expert the worst. When trained on only the 10% subset of the train- As the final experiment, we evaluated KINN’s performance ing set, the LSTM model attained a MSE of 6.68, whereas in cases where the expert predictions are not useful at all. We (a) Predictions of all models (b) Step wise error plot Figure 7: Prediction and error plot with only 50% of the training data being utilized (a) Predictions of all models (b) Step wise error plot Figure 8: Prediction and error plot with only 10% of the training data being utilized (a) Predictions of all models (b) Step wise error plot Figure 9: Prediction and error plot with inaccurate expert prediction achieved this via two different settings. In the first setting, with the predictions. Putting zero in place of x̂pt in Eq. 2 we considered that the expert network predicts zero every yields: time. In the second setting the expert network was made to lag by a step of one resulting in mismatch of the time step x̂t = φ([xt−1 , xt−2 , ..., xt−p , 0]; W) + 0 (a) Predictions of all models (b) Step wise error plot Figure 10: Prediction and error plot with inaccurate expert prediction and with only 10% data huge reduction in the size of the training set, the MSE does 1 X not drastically increase as one would expect. This is due to W ∗ = arg min (xt − (φ([xt−1 , ..., xt−p , 0]; W) the strong seasonal component present in the dataset. As a W |X | x∈X result, even with only 10% subset of the training data, the +0))2 algorithms were able to learn the general pattern exhibited by the sequence. It is only in estimating small variations that 1 X W ∗ = arg min (xt − (φ([xt−1 , ..., xt−p , 0]; W))2 these networks faced difficulty when trained on less data. W |X | x∈X This is almost equivalent to the normal unconditioned full Conclusion input to output space projection learning case (Eq. 1) except We propose a new architecture for incorporating expert a zero in the conditioning vector. However, in case of lagged knowledge into the deep network. It incorporates this expert predictions by the expert network, since we stack the ex- knowledge in a residual scheme where the network learns a pert prediction x̂pt in a separate channel, the network assigns correction term for the predictions made by the expert. The a negligible weight to this channel, resulting in exactly the knowledge incorporation scheme introduced by KINN has same performance as the normal case. three key advantages. The first advantage is regarding the Table 1 provides the details regarding the results obtained relaxation of the requirement for a huge dataset to train the for this experiment. It is clear from the table that in cases model. The second advantage is regarding the provision of where the expert network either gave zero as its predictions a lower bound on the performance of the resulting classifier or gave lagged predictions, which is useless, the network since KINN achieves the best of both worlds by combin- performance was identical to the normal case since the net- ing the two different modalities. The third advantage is its work learned to ignore the output from the expert. These robustness in catering for poor/noisy predictions made by results highlight that KINN provides a lower bound on the the expert. Through extensive evaluation, we demonstrated performance based on the performance of the two involved that the underlying residual function learned by the network entities: expert model and the network. makes the system robust enough to deal with imprecise ex- pert information even in cases where there is a dearth of Discussion labelled data. This is because the network does not try to These thorough experiments advocate that the underlying imitate predictions made by the expert network, but instead residual mapping function learned by KINN is successful in extracts and combines useful information contained in both combining the network with the prediction made by the ex- of the domains. pert. Specifically, KINN demonstrated the ability to recog- nize the quality of the prediction made by both of the base Acknowledgements networks and shifted its reliance according to it. In all of This work is partially supported by Higher Education Com- the experiments that we have conducted, MSE of the pre- mission (Pakistan), ”Continental Automotive GmbH” and dictions made by KINN never exceeded (disregarding in- BMBF project DeFuseNN (Grant 01IW17002). significant changes) the MSE of the predictions achieved by the best among the LSTM and the expert model except in case of completely useless expert predictions, where it per- References formed on par with the LSTM network. Table 1 provides a Baptista, M.; Sankararaman, S.; de Medeiros, I. P.; Nasci- summary of the results obtained from all the different exper- mento Jr, C.; Prendinger, H.; and Henriques, E. M. 2018. iments performed. It is interesting to note that even with a Forecasting fault events for predictive maintenance using data-driven techniques and arma modeling. Computers & Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering Industrial Engineering 115:41–53. the game of go with deep neural networks and tree search. Box, G. E.; Jenkins, G. M.; Reinsel, G. C.; and Ljung, G. M. nature 529(7587):484. 2015. Time series analysis: forecasting and control. John Sun, C.; Shrivastava, A.; Singh, S.; and Gupta, A. 2017. Re- Wiley & Sons. visiting unreasonable effectiveness of data in deep learning Chiu, C.-C.; Sainath, T. N.; Wu, Y.; Prabhavalkar, R.; era. In Computer Vision (ICCV), 2017 IEEE International Nguyen, P.; Chen, Z.; Kannan, A.; Weiss, R. J.; Rao, K.; Conference on, 843–852. IEEE. Gonina, E.; et al. 2018. State-of-the-art speech recognition Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, with sequence-to-sequence models. In 2018 IEEE Interna- D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing proper- tional Conference on Acoustics, Speech and Signal Process- ties of neural networks. arXiv preprint arXiv:1312.6199. ing (ICASSP), 4774–4778. IEEE. Towell, G. G., and Shavlik, J. W. 1994. Knowledge- Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bor- based artificial neural networks. Artificial intelligence 70(1- des, A. 2017. Supervised learning of universal sentence 2):119–165. representations from natural language inference data. arXiv Tran, S. N., and Garcez, A. S. d. 2018. Deep logic net- preprint arXiv:1705.02364. works: Inserting and extracting knowledge from deep belief Fildes, R.; Goodwin, P.; and Onkal, D. 2015. Information networks. IEEE transactions on neural networks and learn- use in supply chain forecasting. ing systems 29(2):246–258. Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse Wang, F.-Y.; Zhang, J. J.; Zheng, X.; Wang, X.; Yuan, Y.; rectifier neural networks. In Proceedings of the fourteenth Dai, X.; Zhang, J.; and Yang, L. 2016. Where does alphago international conference on artificial intelligence and statis- go: From church-turing thesis to alphago thesis and beyond. tics, 315–323. IEEE/CAA Journal of Automatica Sinica 3(2):113–120. Gollapalli, S. D.; Li, X.-L.; and Yang, P. 2017. Incorporating Weiss, R. J.; Chorowski, J.; Jaitly, N.; Wu, Y.; and Chen, Z. expert knowledge into keyphrase extraction. In AAAI, 3180– 2017. Sequence-to-sequence models can directly translate 3187. foreign speech. arXiv preprint arXiv:1703.08581. He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid- Wu, Y.; Wu, W.; Li, Z.; and Zhou, M. 2016. Knowledge ual learning for image recognition. In Proceedings of the enhanced hybrid neural network for text matching. arXiv IEEE conference on computer vision and pattern recogni- preprint arXiv:1611.04684. tion, 770–778. Xu, J.; Zhang, Z.; Friedman, T.; Liang, Y.; and Broeck, G. Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.-r.; V. d. 2017. A semantic loss function for deep learning with Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, symbolic knowledge. arXiv preprint arXiv:1711.11157. T. N.; et al. 2012. Deep neural networks for acoustic model- Zhang, X.-Y.; Yin, F.; Zhang, Y.-M.; Liu, C.-L.; and Ben- ing in speech recognition: The shared views of four research gio, Y. 2018. Drawing and recognizing chinese characters groups. IEEE Signal processing magazine 29(6):82–97. with recurrent neural network. IEEE transactions on pattern Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill- analysis and machine intelligence 40(4):849–862. ing the knowledge in a neural network. arXiv preprint Zhao, W. X.; Jiang, J.; Weng, J.; He, J.; Lim, E.-P.; Yan, H.; arXiv:1503.02531. and Li, X. 2011. Comparing twitter and traditional media Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016. using topic models. In European conference on information Harnessing deep neural networks with logic rules. arXiv retrieval, 338–349. Springer. preprint arXiv:1603.06318. Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural net- works. In Advances in neural information processing sys- tems, 1097–1105. Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015. Human-level concept learning through probabilistic pro- gram induction. Science 350(6266):1332–1338. Pacchin, E.; Gagliardi, F.; Alvisi, S.; Franchini, M.; et al. 2017. A comparison of short-term water demand forecasting models. In CCWI2017, 24–24. The University of Sheffield. Rohrbach, M.; Stark, M.; and Schiele, B. 2011. Evaluat- ing knowledge transfer and zero-shot learning in a large- scale setting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, 1641–1648. IEEE. Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;