=Paper= {{Paper |id=Vol-2350/paper23 |storemode=property |title=KINN: Incorporating Expert Knowledge in Neural Networks |pdfUrl=https://ceur-ws.org/Vol-2350/paper23.pdf |volume=Vol-2350 |authors=Muhammad Ali Chattha,Shoaib Ahmed Siddiqui,Muhammad Imran Malik,Ludger Van Elst,Andreas Dengel ,Sheraz Ahmed |dblpUrl=https://dblp.org/rec/conf/aaaiss/ChatthaSMEDA19 }} ==KINN: Incorporating Expert Knowledge in Neural Networks == https://ceur-ws.org/Vol-2350/paper23.pdf
                  KINN: Incorporating Expert Knowledge in Neural Networks
           Muhammad Ali Chattha123 , Shoaib Ahmed Siddiqui12 , Muhammad Imran Malik34 ,
                      Ludger van Elst1 , Andreas Dengel12 , Sheraz Ahmed1
                      1
                       German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany.
                                         2
                                           TU Kaiserslautern, Kaiserslautern, Germany.
                              3
                                School of Electrical Engineering and Computer Science (SEECS),
                         National University of Sciences and Technology (NUST), Islamabad, Pakistan.
                   4
                     Deep Learning Laboratory, National Center of Artificial Intelligence, Islamabad, Pakistan.

                            Abstract                                 ing (Conneau et al. 2017) to speech recognition (Hinton
                                                                     et al. 2012). The biggest highlight of which was perhaps
  The ability of Artificial Neural Networks (ANNs) to learn ac-
  curate patterns from large amount of data has spurred inter-
                                                                     Google DeepMind’s AlphaGo system, beating one of the
  est of many researchers and industrialists alike. The promise      world’s best Go player, Lee Sedol in a 5 series match (Wang
  of ANNs to automatically discover and extract useful fea-          et al. 2016). Consequently, the idea of superseding human
  tures/patterns from data without dwelling on domain exper-         performance has opened a new era of research and interest in
  tise although seems highly promising but comes at the cost         artificial intelligence. However, the success of DNNs over-
  of high reliance on large amount of accurately labeled data,       shadows its limitations. Arguably the most severe limitation
  which is often hard to acquire and formulate especially in         is its high reliance on large amount of accurately labeled
  time-series domains like anomaly detection, natural disas-         data which in many applications is not available (Sun et al.
  ter management, predictive maintenance and healthcare. As          2017). This is specifically true in domains like anomaly de-
  these networks completely rely on data and ignore a very im-       tection, natural disaster management and healthcare. More-
  portant modality i.e. expert, they are unable to harvest any
  benefit from the expert knowledge, which in many cases is
                                                                     over, training a network solely on the basis of data may re-
  very useful. In this paper, we try to bridge the gap between       sult in poor performance on examples that are not or less
  these data driven and expert knowledge based systems by in-        often seen in the data and may also lead to counter intuitive
  troducing a novel framework for incorporating expert knowl-        results (Szegedy et al. 2013).
  edge into the network (KINN). Integrating expert knowledge            Humans tend to learn from examples specific to the prob-
  into the network has three key advantages: (a) Reduction in        lem, similar to DNNs, as well as from different sources
  the amount of data needed to train the model, (b) provision        of knowledge and experiences (Lake, Salakhutdinov, and
  of a lower bound on the performance of the resulting clas-
                                                                     Tenenbaum 2015). This makes it possible for humans to
  sifier by obtaining best of both worlds, and (c) improved
  convergence of model parameters (model converges in lesser         learn just from acquiring knowledge about the problem with-
  number of epochs). Although experts are extremely good in          out even looking at the data pertaining to it. Domain experts
  solving different tasks, there are some trends and patterns,       are quite proficient in tasks belonging to their area of ex-
  which are usually hidden only in the data. Therefore, KINN         pertise due to their extensive knowledge and understanding
  employs a novel residual knowledge incorporation scheme,           of the problem, which they have acquired overtime through
  which can automatically determine the quality of the predic-       relevant education and experiences. Hence, they rely on their
  tions made by the expert and rectify it accordingly by learning    knowledge when dealing with problems. Due to their deep
  the trends/patterns from data. Specifically, the method tries to   insights, expert predictions even serve as a baseline for mea-
  use information contained in one modality to complement in-        suring the performance of DNNs. Nonetheless, it can not
  formation missed by the other. We evaluated KINN on a real
                                                                     be denied that apart from knowledge, the data also contains
  world traffic flow prediction problem. KINN significantly su-
  perseded performance of both the expert and as well as the         some useful information for solving problems. This is par-
  base network (LSTM in this case) when evaluated in isola-          ticularly cemented by astonishing results achieved by the
  tion, highlighting its superiority for the task.                   DNNs that solely rely on data to find and utilize hidden fea-
                                                                     tures contained in the data itself (Krizhevsky, Sutskever, and
Deep Neural Networks (DNNs) have revolutionized the                  Hinton 2012).
domain of artificial intelligence by exhibiting incredible              Therefore, a natural step forward is to combine both these
performance in applications ranging from image classifi-             separate streams of knowledge i.e. knowledge extracted
cation (Krizhevsky, Sutskever, and Hinton 2012), playing             from the data and the expert’s knowledge. As a matter of
board games (Silver et al. 2016), natural language process-          fact, supplementing DNNs with expert knowledge and pre-
Copyright held by the author(s). In A. Martin, K. Hinkelmann, A.     dictions in order to improve their performance has been ac-
Gerber, D. Lenat, F. van Harmelen, P. Clark (Eds.), Proceedings of   tively researched upon. A way of sharing knowledge among
the AAAI 2019 Spring Symposium on Combining Machine Learn-           classes in the data has been considered in zero-shot-learning
ing with Knowledge Engineering (AAAI-MAKE 2019). Stanford            (Rohrbach, Stark, and Schiele 2011), where semantic relat-
University, Palo Alto, California, USA, March 25-27, 2019.           edness among classes is used to find classes related to the
known ones. Although such techniques employ knowledge               termediate conclusions. The network is designed to have a
transfer, they are restricted solely to the data domain and         one-to-one correspondence with the elements of the rule set,
the knowledge is extracted and shared from the data itself          where neurons and the corresponding weights of their con-
without any intervention from the expert. Similarly, expert         nections are specified by the rules. Apart from these rule
knowledge and opinions are incorporated using distillation          based connections and neurons, additional neurons are also
technique where expert network produces soft predictions            added to learn features not specified in the rule set. Similar
that the DNN tries to emulate or in the form of posterior           approach has also been followed by (Tran and Garcez 2018).
regularization over DNN predictions (Hinton, Vinyals, and           Although such approaches directly incorporates knowledge
Dean 2015). All of these techniques try to strengthen DNN           into the network, but they also limit the network architec-
with expert knowledge. However, cases where the expert              ture by forcing it to have strict correspondence with the rule
model is unreliable or even random have not been consid-            base. As a result, this restricts the use of alternate architec-
ered. Moreover, directly trying to mimic expert network pre-        tures or employing network that does not directly follow the
dictions has an implicit assumption regarding the high qual-        structure defined by the rule set.
ity of the predictions made by the expert. We argue that               (Hu et al. 2016) integrated expert knowledge using first
the ideal incorporation of expert network would be the one          order logic rules which is transferred to the network pa-
where strengths of both networks are promoted and weak-             rameters through iterative knowledge distillation (Hinton,
nesses are suppressed. Hence, we introduce a step in this           Vinyals, and Dean 2015). The DNN tries to emulate soft
direction by proposing a novel framework, Knowledge In-             predictions made by the expert network, instilling expert
tegrated Neural Network (KINN), which aims to integrates            knowledge into the network parameters. Hence, the expert
knowledge residing in heterogeneous sources, in the form of         network acts as a teacher to the DNN i.e. the student net-
predictions, in a residual scheme KINN’s design allows it           work. The objective function is taken as a weighted average
to be flexible. KINN can successfully integrate knowledge           between imitating the soft predictions made by the teacher
in cases where predictions of the expert and DNN align and          network and true hard label predictions. The teacher network
as well as in scenarios where they are completely disjoint.         is also updated at each iteration step with the goal of find-
Finding state-of-the-art DNN or expert model is not the aim         ing the best teacher network that fits the rule set while, at
here but rather, the aim is to devise a strategy that facilitates   the same time, also staying close to the student network. In
integration of expert knowledge with DNNs in a way that             order to achieve this goal, KL-divergence between the prob-
the final network achieves best of both worlds.                     ability distribution of the predictions made by the teacher
   The residual scheme employed in KINN to incorporate              network and softmax output layer of the student network is
expert knowledge inside the network has three key advan-            used as the objective function to be minimized. This acts
tages: (a) Significant reduction in the amount of data needed       as a constraint over model posterior. The proposed frame-
to train the model, since the network has to learn a resid-         work was evaluated for classification tasks and achieved su-
ual function instead of learning the complete input to output       perior results compared to other state-of-the-art models at
space projection, (b) a lower bound on the performance of           that time. However, the framework strongly relies on the ex-
KINN based on the performance of the two subsequent clas-           pert network for parametric optimization and does not cater
sifiers achieving the best of both worlds, and (c) improve-         for cases where expert knowledge is not comprehensive.
ments in convergence of the model parameters as learning               Expert knowledge is incorporated for key phrase extrac-
a residual mapping makes the optimization problem signif-           tion by (Gollapalli, Li, and Yang 2017) where they defined
icantly easier to tackle. Moreover, since the DNN itself is         label-distribution rules that dictates the probability of a word
data driven, this makes KINN robust enough to deal with             being a key phrase. For example, the rule enunciates that
situations where the predictions made by the expert model           a noun that appears in the document as well as in the title
are not reliable or even useless.                                   is 90% likely to be a key phrase and thus acts as posterior
   The rest of the paper is structured as follows: We first         regularization providing weak supervision for the classifica-
provide a brief overview of the work done in the direction          tion task. Similarly, KL-divergence between the distribution
of expert knowledge incorporation in the past. We then ex-          given by the rule set and the model estimates is used as the
plain the proposed framework, KINN, in detail. After that,          objective function to be used for the optimization. Again, as
we present the evaluation results regarding the different ex-       the model utilizes knowledge to strengthen the predictions
periments performed in order to prove the efficacy of KINN          of the network, it shifts the dependency of the network from
for the task of expert knowledge incorporation. Finally, we         the training data to accurate expert knowledge which might
conclude the paper with the conclusion.                             just be an educated guess in some cases. Similarly, (Xu et
                                                                    al. 2017) incorporated symbolic knowledge into the network
                      Related Work                                  by deriving a semantic loss function that acts as a bridge be-
Integrating domain knowledge and experts opinion into the           tween the network outputs and the logical constraints. The
network is an active area of research and even dates back to        semantic loss function is based on constraints in the form
the early 90s. Knowledge-based Artificial Neural Networks           of propositional logic and the probabilities computed by the
(KBANN) was proposed by (Towell and Shavlik 1994).                  network. During training, the semantic loss is added to the
KBANN uses knowledge in the form of propositional rule              normal loss of the network and thus acts as a regularization
sets which are hierarchically structured. In addition to di-        term. This ensures that symbolic knowledge plays a part in
rectly mapping inputs to outputs, the rules also state in-          updating the parameters of the network.
   (Wu et al. 2016) proposed a Knowledge Enhanced Hy-
brid Neural Network (KEHNN). KEHNN utilizes knowl-
edge in conjunction with the network to cater for text match-
ing in long texts. Here, knowledge is considered to be the
global context such as topics, tags etc. obtained from other
algorithms that extracts information from multiple sources
and datasets. They employed the twitter LDA model (Zhao
et al. 2011) as the prior knowledge which was consid-
ered useful in filtering out noise from long texts. A spe-
cial gate known as the knowledge gate is added to the tra-
ditional bi-directional Gated Recurrent Units (GRU) in the
model which controls how much information from the ex-
pert knowledge flows into the network.

        KINN: The Proposed Framework                               Figure 1: Traffic flow data grouped into 30 minute windows
Problem Formalization
Time-series forecasting is of vital significance due to its high   Dataset
impact, specifically in domains like supply chain (Fildes,         We evaluated KINN on Caltrans Performance Measurement
Goodwin, and Onkal 2015), demand prediction (Pacchin               System (PeMS) data. The data contains records of sensor
et al. 2017), and fault prediction (Baptista et al. 2018).         readings that measure the flow of vehicular traffic on Cali-
In a typical forecasting setting, a sequence of values             fornia Highways. Since the complete PeMS dataset is enor-
{xt−1 , xt−2 , ..., xt−p } from the past are used to predict the   mous in terms of its size comprising of records from mul-
value of the variable at time-step t, where p is the number        tiple highways, we only considered a small fraction of it
of past values leveraged for a particular prediction, which        for our experiments i.e. the traffic flow on Richards Ave,
we refer as the window size. Hence, the model is a func-           from January 2016 till March 20161 . The dataset contains
tional mapping from past observations to the future value.         information regarding the number of vehicles passing on the
This parametric mapping can be written as:                         avenue every 30 seconds. PeMS also contains other details
                                                                   regarding the vehicles, however, we only consider the prob-
             x̂t = φ([xt−1 , xt−2 , ..., xt−p ]; W)                lem of average traffic flow forecasting in this paper. The data
                      L                                            is grouped into 30 minute windows. The goal is to predict
where W = {Wl , bl }l=1 encapsulates the parameters of the         average number of vehicles per 30 seconds for the next 30
network and φ : Rp 7→ R defines the map from the in-               minutes. Fig. 1 provides an overview of the grouped dataset.
put space to the output space. The optimal parameters of           The data clearly exhibits a seasonal component along with
the network W ∗ are computed based on the empirical risk           high variance for the peaks.
computed over the training dataset. Using MSE as the loss
function, the optimization problem can be stated as:
                                                                   Baseline Expert and Deep Models
                     1 X                                           LSTMs have achieved state-of-the-art performance in a
 W ∗ = arg min               (xt − φ([xt−1 , ..., xt−p ]; W))2     range of different domains comprising of sequential data
              W     |X |                                           such as language translation (Weiss et al. 2017), and hand-
                         x∈X
                                                             (1)   writing and speech recognition (Zhang et al. 2018; Chiu et
where X denotes the set of training sequences and x ∈              al. 2018). Since we are dealing with sequential data, hence,
Rp+1 . Solving this optimization problem comprising of             LSTM was a natural choice as our baseline neural net-
thousands, if not millions of parameters, requires large           work model. Although the aim of this work is to develop
amount of data in order to successfully constrain the para-        a technique capable of fusing useful information contained
metric space so that a reliable solution is obtained.              in two different modalities, irrespective of their details, still
   Humans on the other hand, leverage their real-world             we spent significant compute time to discover the optimal
knowledge along with their past-experiences in order to            network hyperparameters. This is done through grid-search
make predictions about the future. The aim of KINN is to in-       confined to a reasonable hyperparameter search space. The
ject this real-world knowledge in the form of expert into the      hyperparameter search space included number of layers in
system. However, as mentioned, information from the ex-            the network, number of neurons in each layer, activation
pert may not be reliable, therefore, KINN proposes a novel         function for each layer, along with the window size p.
residual learning framework for the incorporation of expert           Partial auto-correlation of the series was also analyzed to
knowledge into the system. The residual framework condi-           identify association of the current value in the time-series
tions the prediction of the network on the expert’s opinion.       with its lagged version as shown in Fig. 2. As evident from
As a result, the network acts as a correcting entity for the       the figure, the series showed strong correlation with its past
values generated by the expert. This decouples our system
                                                                      1
from complete reliance on the expert knowledge.                           http://www.stat.ucdavis.edu/~clarkf/
                                                                 as the expert network seems plausible as shown in Fig. 4(a).
                                                                 However, it is only through thorough inspection and inves-
                                                                 tigation on a narrower scale that strengths and weaknesses
                                                                 of each of the networks are unveiled as shown in Fig. 4(b).
                                                                 The LSTM tends to capture the overall trend of the data but
                                                                 suffered when predicting small variations in the time-series.
                                                                 SARIMA on the other hand was more accurate in predict-
                                                                 ing variations in the time-series. In terms of MSE, LSTM
                                                                 model performed considerably worse when compared to the
                                                                 expert model. For this dataset, the discovered LSTM model
                                                                 achieved a MSE of 5.90 compared to 1.24 achieved by
                                                                 SARIMA on the test set.

                                                                 KINN: Knowledge Integrated Neural Network
      Figure 2: Partial auto-correlation of time-series          Most of the work in the literature (Hu et al. 2016; Gollapalli,
                                                                 Li, and Yang 2017) on incorporating expert knowledge into
                                                                 the neural network focuses on training the network by forc-
                                                                 ing it to mimic the predictions made by the expert network,
                                                                 ergo updating weights of the network based on the expert’s
                                                                 information. However, they do not cater for a scenario where
                                                                 expert network does not contain information about all possi-
                                                                 ble scenarios. Moreover, these hybrid knowledge based net-
                                                                 work approaches are commonly applied to the classification
                                                                 scenario where output vector of the network corresponds to
                                                                 a probability distribution. This allows KL-divergence to be
                                                                 used as the objective function to be minimized in order to
                                                                 match predictions of the network and the expert network. In
                                                                 case of time-series forecasting, the output of the network is
                                                                 a scalar value instead of a distribution which handicaps most
                                                                 of the prior frameworks proposed in the literature.
                                                                     The KINN framework promotes both the expert model as
                                                                 well as the network to complement each other rather than
                                                                 directly mimicking the expert’s output. This allows KINN to
           Figure 3: Neural network architecture                 successfully tackle cases where predictions from the expert
                                                                 are not reliable. Finding the best expert or neural network
                                                                 is not the focus here but instead, the focus is to incorporate
three values. This is also cemented by the result of the grid-   expert prediction, may it be flawed, in such a way that the
search that chose the window size of three. The final net-       neural network maintains its strengths while incorporating
work consisted of three hidden LSTM layers followed by           strengths of the expert network.
a dense regression layer. Apart from the first layer, which          There are many different ways through which knowledge
used sigmoid, Rectified Linear Unit (ReLU) (Glorot, Bor-         between an expert and the network can be integrated. Let
des, and Bengio 2011) was employed as the activation func-       x̂pt ∈ R be the prediction made by the expert. We incor-
tion. Fig. 3 shows the resulting network architecture. The       porate the knowledge from the expert in a residual scheme
data is segregated into train, validation and test set using     inspired by the idea of ResNet curated by (He et al. 2016).
70/10/20 ratio. MSE was employed as the corresponding            Let φ : Rp+1 7→ R define the mapping from the input space
loss function to be optimized. The network was trained for       to the output space. The learning problem from Eq. 1 after
600 epochs and the parameters producing the best validation      availability of the expert information can be now be written
score were used for generating predictions on the test set.      as:
   Auto-Regressive Integrated Moving Average (ARIMA) is
widely used by experts in time-series modelling and analy-               x̂t = φ([xt−1 , xt−2 , ..., xt−p , x̂pt ]; W) + x̂pt
sis. Therefore, we employed ARIMA as the expert opinion
in our experiments. Since the data demonstrated a signifi-                          1 X
cant seasonal component, the seasonal variant of ARIMA           W ∗ = arg min          (xt − (φ([xt−1 , ..., xt−p , x̂pt ]; W)
                                                                            W      |X |
(SARIMA) was used, whose parameters were estimated us-                                  x∈X
ing the Box-Jenkins approach (Box et al. 2015). Fig. 4                                                                 +x̂pt ))2
demonstrates the predictions obtained by employing the                                                                        (2)
LSTM model as well as the expert (SARIMA) model on the           Instead of computing a full input space to output space trans-
test set.                                                        form as in Eq. 1, the network instead learns a residual func-
   The overall predictions made by both the LSTM as well         tion. This residual function can be considered as a correction
                (a) Predictions over the whole test set                       (b) Predictions over the first 100 steps

                                          Figure 4: Predictions of NN and Expert Network


term to the prediction made by the expert model. Since the          employed in the first experiment. In the first case, we re-
model is learning a correction term for the expert’s predic-        duced the amount of training data provided to the models
tion, it is essential for the model prediction to be conditioned    for training. We present the findings from this experiment in
on the expert’s prediction as indicated in Eq. 2. There are         experiment # 02. In the second case, we reduced the reliabil-
two simple ways to achieve this conditioning for the LSTM           ity of the expert predictions by injecting random noise. The
network. The first one is to append the prediction at the end       results from this experiment are summarized in experiment
of the sequence as indicated in the equation. Another pos-          # 03. A direct extension of the last two experiments is to
sibility is to stack a new channel to the input with repeated       evaluate KINN’s performance in cases where both of these
values for the expert’s prediction. The second case makes the       conditions hold i.e. the amount of training data is reduced
optimization problem easier as the network has direct access        as well as the expert is noisy. We summarize the results for
to the expert’s prediction at every time-step. Therefore, re-       this experiment in experiment # 04. Finally, we evaluated
sults in minor improvements in terms of MSE. The system             KINN’s performance in cases where the expert contained no
architecture for KINN is shown in Fig. 5.                           information. We achieved this using two different ways. We
   Incorporating expert knowledge in this residual fashion          first evaluated the case where the expert always predicted
serves a very important purpose in our case. In cases where         the value of zero. In this case, the target was to evaluate the
the expert’s predictions are inaccurate, the network can gen-       impact (if any) of introducing the residual learning scheme
erate large offsets in order to compensate for the error, while     since the amount of information presented to the LSTM net-
the network can essentially output zero in cases where the          work was exactly the same as the isolated LSTM model in
expert’s predictions are extremely accurate. With this flexi-       the first experiment. We then tested a more realistic scenario,
bility built into the system, the system can itself decide its      where the expert model replicated the values from the last
reliance on the expert’s predictions.                               time-step of the series. We elaborate the findings from this
                                                                    experiment (for both settings) in experiment # 05.
                        Evaluation
                                                                    Experiment # 01: Full training set and accurate
We curated a range of different experiments each employing
KINN in a unique scenario in order to evaluate its perfor-
                                                                    expert
mance under varied conditions. We compare KINN results              We first tested both the LSTM as well as the expert model
with the expert as well as the DNN in terms of performance          in isolation in order to precisely capture the impact of in-
to highlight the gains achieved by employing the residual           troducing the residual learning scheme. KINN demonstrated
learning scheme. To ensure a fair comparison, all of the pre-       significant improvements in training dynamics directly from
processing and LSTM hyperparameters were kept the same              the start. KINN converged faster as compared to the isolated
when the model was tested in isolation and when integrated          LSTM. As opposed to the isolated LSTM which required
as the residual function in KINN.                                   more training time (epochs) to converge, KINN normally
   In the first setting, we tested and compared KINN’s per-         converged in only one fouth of the epochs taken by the iso-
formance in the normal case where the expert predictions            lated LSTM, which is a significant improvement in terms
are accurate and the LSTM is trained on the complete train-         of the compute time. Apart from the compute time, KINN
ing set available. We present the results from this normal          achieved a MSE of 0.74 on the test set. This is a very sig-
case in experiment # 01. In order to evaluate KINN’s per-           nificant improvement in comparison to the isolated LSTM
formance in cases where the amount of training data avail-          model that had a MSE of 5.90. Even compared to the expert
able is small or the expert is inaccurate, we established two       model, KINN demonstrated a relative improvement of 40%
different sets of experiments starting from the configuration       in terms of MSE. Fig. 6 showcases the predictions made by
                                               Figure 5: Proposed Architecture

                                                                                                                   MSE
Experiment                      Description                       % of training data used        DNN          Expert Network   KINN
    1             Full training set and accurate expert                      100                  5.90             1.24        0.74
    2        Reduced training set (50%) and accurate expert                  50                   6.36             1.52        0.89
             Reduced training set (10%) and accurate expert                  10                   6.68             2.67        1.53
    3               Full training set and noisy expert                       100                  5.90             7.81        3.09
    4             Reduced training set and noisy expert                      10                   6.68             7.81        3.73
    5            Full training set and Zero expert pred.                     100                  5.90            621.00       5.92
                Full training set and Delayed expert pred.                   100                  5.90             9.04        5.91

                              Table 1: MSE on the test set for the experiments performed




               (a) Predictions of all models                                       (b) Step-wise error plot

             Figure 6: Predictions and the corresponding error plot for the normal case (experiment # 01)
KINN along with the isolated LSTM and the expert network           the expert model achieved MSE of 2.67. KINN on the other
on the test set. It is evident from the figure that KINN caters    hand, still outperformed both of these models and achieved
for the weaknesses of each of the two models involved using        a MSE of 1.53.
the information contained in the other. The resulting predic-
tions are more accurate than the expert network on minimas         Experiment # 03: Full training set and noisy expert
and also captures the small variations in the series which         In all of the previous experiments, the expert model was rel-
were missed by the LSTM network.                                   atively better compared to the LSTM model employed in our
   In order to further evaluate the results, error at each time-   experiments. The obtained results highlights KINN’s ability
step is compared for the isolated models along with KINN.          to capitalize over the information obtained from the expert
To aid the visualization, step-wise error for first 100 time-      model to achieve significant improvements in its prediction.
steps of the test set is shown in Fig. 6. The plot shows that      KINN also demonstrated amazing generalization despite of
the step-wise prediction error of KINN is less than both the       drastic reduction in the amount of training data, highlighting
expert model as well as the LSTM for major portion of the          KINN’s ability to achieve accurate predictions in low data
time.                                                              regimes. However, in conjunction to reducing dependency
   However, there are instances where predictions made by          of the network on data, it is also imperative that the net-
KINN are slightly worse than those of the baseline models.         work does not become too dependent on the expert knowl-
In particular, the prediction error of KINN exceeded the er-       edge making it essential to be accurate/perfect. This is usu-
ror of the expert network for only 30% of the time-steps and       ally not catered for in most of the prior work. We believe that
only 22% of the time-steps in case of the LSTM network.            the proposed residual scheme enabled the network to handle
Nevertheless, even in those instances, the performance of          erroneous expert knowledge efficiently by allowing it to be
KINN was still on par with the other models since on 99%           smart enough to realize weaknesses in the expert network
of the time-steps, the difference in error is less than 1.5.       and adjust accordingly. In order to verify KINN’s ability to
                                                                   adjust with poor predictions from the expert, we performed
Experiment # 02: Reduced training set and                          another experiment where random noise was injected into
accurate expert                                                    the predictions from the expert network. This random noise
One of the objectives of KINN was to reduce dependency             degraded the reliability of the expert predictions. To achieve
of the network on large amount of labelled data. We argue          this, random noise within one standard deviation of the aver-
that the proposed model not only utilizes expert knowledge         age traffic flow was added to the expert predictions. As a re-
to cater for shortcomings of the network, but also helps in        sult, the resulting expert predictions attained a MSE of 7.81
significantly reducing its dependency on the data. To further      which is considerably poor compared to that of the LSTM
evaluate this claim, a series of experiments were performed.       (5.90). We then trained KINN using these noisy expert pre-
KINN was trained again from scratch using only 50% of              dictions. Fig. 9 visualizes the corresponding prediction and
the data in the training set. The test set remained unchanged.     error plots.
Similarly, the LSTM network was also trained with the same            As evident from Fig. 9(a), KINN still outperformed both
50% subset of the training set.                                    the expert as well as the LSTM with a MSE of 3.09. De-
   The LSTM network trained on the 50% subset of the               spite the fact that neither the LSTM, nor the expert model
training data attained a MSE of 6.36 which is slightly worse       was accurate, KINN still managed to squeeze out useful in-
than the MSE of network trained on the whole training set.         formation from both modalities to construct an accurate pre-
Minor degradation was also observed in the performance of          dictor. This demonstrates true strength of KITNN as it not
the expert network which achieved a MSE of 1.52. Despite           only reduces dependency of the network on the data but also
of this reduction in the dataset size, KINN achieved signifi-      adapts itself in case of poorly made expert opinions. KINN
cantly better results compared to both the LSTM as well as         achieved a significant reduction of 48% in the MSE of the
the expert model achieving a MSE of 0.89. Fig 7 visualizes         LSTM network by incorporating the noisy expert prediction
the corresponding prediction and error plots of the models         in the residual learning framework.
trained on 50% subset of the training data.
   We performed the same experiment again with a very              Experiment # 04: Reduced training set and noisy
drastic reduction in the training dataset size by using only       expert
10% subset of the training data. Fig. 8 visualizes the results     As a natural followup to the last two experiments, we intro-
from this experiment in the same way, by first plotting the        duced both conditions at the same time i.e. reduced training
predictions from the models along with the error plot. It is       set size and noisy predictions from the expert. The train-
interesting to note that since the LSTM performed consider-        ing set was again reduced to 10% subset of the training data
ably poor due to extremely small training set size, the net-       for training the model while keeping the testing set intact.
work shifted its focus to the predictions of the expert net-       Fig. 10 demonstrates that despite this worst condition, KINN
work and made only minor corrections to it as evident from         still managed to outperform both the LSTM as well as the
Fig. 8(a). This highlights KINN’s ability to decide its re-        noisy expert predictions.
liance on the expert predictions based on the quality of the
information. In terms of the MSE, LSTM model performed             Experiment # 05: Full training set and poor expert
the worst. When trained on only the 10% subset of the train-       As the final experiment, we evaluated KINN’s performance
ing set, the LSTM model attained a MSE of 6.68, whereas            in cases where the expert predictions are not useful at all. We
                    (a) Predictions of all models                                     (b) Step wise error plot

                      Figure 7: Prediction and error plot with only 50% of the training data being utilized




                    (a) Predictions of all models                                     (b) Step wise error plot

                      Figure 8: Prediction and error plot with only 10% of the training data being utilized




                    (a) Predictions of all models                                     (b) Step wise error plot

                               Figure 9: Prediction and error plot with inaccurate expert prediction


achieved this via two different settings. In the first setting,     with the predictions. Putting zero in place of x̂pt in Eq. 2
we considered that the expert network predicts zero every           yields:
time. In the second setting the expert network was made to
lag by a step of one resulting in mismatch of the time step
                                                                              x̂t = φ([xt−1 , xt−2 , ..., xt−p , 0]; W) + 0
                    (a) Predictions of all models                                  (b) Step wise error plot

                 Figure 10: Prediction and error plot with inaccurate expert prediction and with only 10% data


                                                                  huge reduction in the size of the training set, the MSE does
                 1 X                                              not drastically increase as one would expect. This is due to
  W ∗ = arg min      (xt − (φ([xt−1 , ..., xt−p , 0]; W)          the strong seasonal component present in the dataset. As a
           W    |X |
                        x∈X                                       result, even with only 10% subset of the training data, the
                                                     +0))2        algorithms were able to learn the general pattern exhibited
                                                                  by the sequence. It is only in estimating small variations that
               1   X
W ∗ = arg min        (xt − (φ([xt−1 , ..., xt−p , 0]; W))2        these networks faced difficulty when trained on less data.
         W    |X |
                      x∈X

   This is almost equivalent to the normal unconditioned full                            Conclusion
input to output space projection learning case (Eq. 1) except     We propose a new architecture for incorporating expert
a zero in the conditioning vector. However, in case of lagged     knowledge into the deep network. It incorporates this expert
predictions by the expert network, since we stack the ex-         knowledge in a residual scheme where the network learns a
pert prediction x̂pt in a separate channel, the network assigns   correction term for the predictions made by the expert. The
a negligible weight to this channel, resulting in exactly the     knowledge incorporation scheme introduced by KINN has
same performance as the normal case.                              three key advantages. The first advantage is regarding the
   Table 1 provides the details regarding the results obtained    relaxation of the requirement for a huge dataset to train the
for this experiment. It is clear from the table that in cases     model. The second advantage is regarding the provision of
where the expert network either gave zero as its predictions      a lower bound on the performance of the resulting classifier
or gave lagged predictions, which is useless, the network         since KINN achieves the best of both worlds by combin-
performance was identical to the normal case since the net-       ing the two different modalities. The third advantage is its
work learned to ignore the output from the expert. These          robustness in catering for poor/noisy predictions made by
results highlight that KINN provides a lower bound on the         the expert. Through extensive evaluation, we demonstrated
performance based on the performance of the two involved          that the underlying residual function learned by the network
entities: expert model and the network.                           makes the system robust enough to deal with imprecise ex-
                                                                  pert information even in cases where there is a dearth of
Discussion                                                        labelled data. This is because the network does not try to
These thorough experiments advocate that the underlying           imitate predictions made by the expert network, but instead
residual mapping function learned by KINN is successful in        extracts and combines useful information contained in both
combining the network with the prediction made by the ex-         of the domains.
pert. Specifically, KINN demonstrated the ability to recog-
nize the quality of the prediction made by both of the base                         Acknowledgements
networks and shifted its reliance according to it. In all of      This work is partially supported by Higher Education Com-
the experiments that we have conducted, MSE of the pre-           mission (Pakistan), ”Continental Automotive GmbH” and
dictions made by KINN never exceeded (disregarding in-            BMBF project DeFuseNN (Grant 01IW17002).
significant changes) the MSE of the predictions achieved by
the best among the LSTM and the expert model except in
case of completely useless expert predictions, where it per-                              References
formed on par with the LSTM network. Table 1 provides a           Baptista, M.; Sankararaman, S.; de Medeiros, I. P.; Nasci-
summary of the results obtained from all the different exper-     mento Jr, C.; Prendinger, H.; and Henriques, E. M. 2018.
iments performed. It is interesting to note that even with a      Forecasting fault events for predictive maintenance using
data-driven techniques and arma modeling. Computers &             Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering
Industrial Engineering 115:41–53.                                 the game of go with deep neural networks and tree search.
Box, G. E.; Jenkins, G. M.; Reinsel, G. C.; and Ljung, G. M.      nature 529(7587):484.
2015. Time series analysis: forecasting and control. John         Sun, C.; Shrivastava, A.; Singh, S.; and Gupta, A. 2017. Re-
Wiley & Sons.                                                     visiting unreasonable effectiveness of data in deep learning
Chiu, C.-C.; Sainath, T. N.; Wu, Y.; Prabhavalkar, R.;            era. In Computer Vision (ICCV), 2017 IEEE International
Nguyen, P.; Chen, Z.; Kannan, A.; Weiss, R. J.; Rao, K.;          Conference on, 843–852. IEEE.
Gonina, E.; et al. 2018. State-of-the-art speech recognition      Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan,
with sequence-to-sequence models. In 2018 IEEE Interna-           D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing proper-
tional Conference on Acoustics, Speech and Signal Process-        ties of neural networks. arXiv preprint arXiv:1312.6199.
ing (ICASSP), 4774–4778. IEEE.                                    Towell, G. G., and Shavlik, J. W. 1994. Knowledge-
Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bor-       based artificial neural networks. Artificial intelligence 70(1-
des, A. 2017. Supervised learning of universal sentence           2):119–165.
representations from natural language inference data. arXiv       Tran, S. N., and Garcez, A. S. d. 2018. Deep logic net-
preprint arXiv:1705.02364.                                        works: Inserting and extracting knowledge from deep belief
Fildes, R.; Goodwin, P.; and Onkal, D. 2015. Information          networks. IEEE transactions on neural networks and learn-
use in supply chain forecasting.                                  ing systems 29(2):246–258.
Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Deep sparse          Wang, F.-Y.; Zhang, J. J.; Zheng, X.; Wang, X.; Yuan, Y.;
rectifier neural networks. In Proceedings of the fourteenth       Dai, X.; Zhang, J.; and Yang, L. 2016. Where does alphago
international conference on artificial intelligence and statis-   go: From church-turing thesis to alphago thesis and beyond.
tics, 315–323.                                                    IEEE/CAA Journal of Automatica Sinica 3(2):113–120.
Gollapalli, S. D.; Li, X.-L.; and Yang, P. 2017. Incorporating    Weiss, R. J.; Chorowski, J.; Jaitly, N.; Wu, Y.; and Chen, Z.
expert knowledge into keyphrase extraction. In AAAI, 3180–        2017. Sequence-to-sequence models can directly translate
3187.                                                             foreign speech. arXiv preprint arXiv:1703.08581.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-         Wu, Y.; Wu, W.; Li, Z.; and Zhou, M. 2016. Knowledge
ual learning for image recognition. In Proceedings of the         enhanced hybrid neural network for text matching. arXiv
IEEE conference on computer vision and pattern recogni-           preprint arXiv:1611.04684.
tion, 770–778.                                                    Xu, J.; Zhang, Z.; Friedman, T.; Liang, Y.; and Broeck, G.
Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.-r.;        V. d. 2017. A semantic loss function for deep learning with
Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath,       symbolic knowledge. arXiv preprint arXiv:1711.11157.
T. N.; et al. 2012. Deep neural networks for acoustic model-      Zhang, X.-Y.; Yin, F.; Zhang, Y.-M.; Liu, C.-L.; and Ben-
ing in speech recognition: The shared views of four research      gio, Y. 2018. Drawing and recognizing chinese characters
groups. IEEE Signal processing magazine 29(6):82–97.              with recurrent neural network. IEEE transactions on pattern
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill-              analysis and machine intelligence 40(4):849–862.
ing the knowledge in a neural network. arXiv preprint             Zhao, W. X.; Jiang, J.; Weng, J.; He, J.; Lim, E.-P.; Yan, H.;
arXiv:1503.02531.                                                 and Li, X. 2011. Comparing twitter and traditional media
Hu, Z.; Ma, X.; Liu, Z.; Hovy, E.; and Xing, E. 2016.             using topic models. In European conference on information
Harnessing deep neural networks with logic rules. arXiv           retrieval, 338–349. Springer.
preprint arXiv:1603.06318.
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012.
Imagenet classification with deep convolutional neural net-
works. In Advances in neural information processing sys-
tems, 1097–1105.
Lake, B. M.; Salakhutdinov, R.; and Tenenbaum, J. B. 2015.
Human-level concept learning through probabilistic pro-
gram induction. Science 350(6266):1332–1338.
Pacchin, E.; Gagliardi, F.; Alvisi, S.; Franchini, M.; et al.
2017. A comparison of short-term water demand forecasting
models. In CCWI2017, 24–24. The University of Sheffield.
Rohrbach, M.; Stark, M.; and Schiele, B. 2011. Evaluat-
ing knowledge transfer and zero-shot learning in a large-
scale setting. In Computer Vision and Pattern Recognition
(CVPR), 2011 IEEE Conference on, 1641–1648. IEEE.
Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;
Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;