=Paper= {{Paper |id=Vol-2253/paper71 |storemode=property |title=On the Readability of Deep Learning Models: the role of Kernel-based Deep Architectures |pdfUrl=https://ceur-ws.org/Vol-2253/paper71.pdf |volume=Vol-2253 |authors=Danilo Croce,Daniele Rossini,Roberto Basili |dblpUrl=https://dblp.org/rec/conf/clic-it/CroceR018 }} ==On the Readability of Deep Learning Models: the role of Kernel-based Deep Architectures== https://ceur-ws.org/Vol-2253/paper71.pdf
                     On the Readability of Deep Learning Models:
                      the role of Kernel-based Deep Architectures
                     Danilo Croce and Daniele Rossini and Roberto Basili
                             Department Of Enterprise Engineering
                                University of Roma, Tor Vergata
                          {croce,basili}@info.uniroma2.it

                     Abstract                          (Croce et al., 2017)). In QC the correct category of
    English. Deep Neural Networks achieve              a question is detected to optimize the later stages
    state-of-the-art performances in several se-       of a question answering system, (Li and Roth,
    mantic NLP tasks but lack of explanation           2006). An epistemologically transparent learning
    capabilities as for the limited interpretabil-     system should trace back the causal connections
    ity of the underlying acquired models. In          between the proposed question category and the
    other words, tracing back causal connec-           linguistic properties of the input question. For
    tions between the linguistic properties of         example, the system could motivate the decision:
    an input instance and the produced clas-           ”What is the capital of Zimbabwe?” refers to a
    sification is not possible. In this paper,         Location, with a sentence such as: Since it is
    we propose to apply Layerwise Relevance            similar to ”What is the capital of California?”
    Propagation over linguistically motivated          which also refers to a Location. Unfortunately,
    neural architectures, namely Kernel-based          neural models, as for example Multilayer Percep-
    Deep Architectures (KDA), to guide argu-           trons (MLP), Long Short-Term Memory Networks
    mentations and explanation inferences. In          (LSTM), (Hochreiter and Schmidhuber, 1997), or
    this way, decisions provided by a KDA              even Attention-based Networks (Larochelle and
    can be linked to the semantics of input ex-        Hinton, 2010), correspond to parameters that have
    amples, used to linguistically motivate the        no clear conceptual counterpart: it is thus difficult
    network output.                                    to trace back the network components (e.g. neu-
                                                       rons or layers in the resulting topology) responsi-
    Italiano.     Le Deep Neural Network               ble for the answer.
    raggiungono oggi lo stato dell’arte in
                                                          In image classification, Layerwise Relevance
    molti processi di NLP, ma la scarsa
                                                       Propagation (LRP) (Bach et al., 2015) has been
    interpretabilitá dei modelli risultanti
                                                       used to decompose backward across the MLP lay-
    dall’addestramento limita la compren-
                                                       ers the evidence about the contribution of indi-
    sione delle loro inferenze. Non é possibile
                                                       vidual input fragments (i.e. pixels of the input
    cioé determinare connessioni causali tra
                                                       images) to the final decision. Evaluation against
    le proprietá linguistiche di un esempio
                                                       the MNIST and ILSVRC benchmarks suggests
    e la classificazione prodotta dalla rete.
                                                       that LRP activates associations between input and
    In questo lavoro, l’applicazione della
                                                       output fragments, thus tracing back meaningful
    Layerwise Relevance Propagation alle
                                                       causal connections.
    Kernel-based Deep Architecture(KDA)
    é usata per determinare connessioni tra              In this paper, we propose the use of a simi-
    la semantica dell’input e la classe di             lar mechanism over a linguistically motivated net-
    output che corrispondono a spiegazioni             work architecture, the Kernel-based Deep Archi-
    linguistiche e trasparenti della decisione.        tecture (KDA), (Croce et al., 2017). Tree Ker-
                                                       nels (Collins and Duffy, 2001) are here used to
                                                       integrate syntactic/semantic information within a
1   Introduction
                                                       MLP network. We will show how KDA input
Deep Neural Networks are usually criticized as         nodes correspond to linguistic instances and by ap-
they are not epistemologically transparent devices,    plying the LRP method we are able to trace back
i.e. their models cannot be used to provide ex-        causal associations between the semantic classifi-
planations of the resulting inferences. An exam-       cation and such instances. Evaluation of the LRP
ple can be neural question classification (QC) (e.g.   algorithm is based on the idea that explanations
improve the user expectations about the correct-                            ˜ = ~cU S − 12
                                                                           ~x                            (2)
ness of an answer and shows its applicability in
human computer interfaces.                              where ~c is the vector whose dimensions contain
   In the rest of the paper, Section 2 describes the    the evaluations of the kernel function between o
KDA neural approach while section 3 illustrates         and each landmark oj ∈ L. Therefore, the method
how LRP connects to KDAs. In section 4 early            produces l-dimensional vectors.
results of the evaluation are reported.                    Given a labeled dataset, a Multi-Layer Percep-
                                                        tron (MLP) architecture can be defined, with a spe-
2   Training Neural Networks in Kernel                  cific Nyström layer based on the Nyström embed-
    Spaces                                              dings of Eq. 2, (Croce et al., 2017).
                                                           Such Kernel-based Deep Architecture (KDA)
Given a training set o ∈ D, a kernel K(oi , oj )
                                                        has an input layer, a Nyström layer, a possibly
is a similarity function over D2 that corresponds
                                                        empty sequence of non-linear hidden layers and a
to a dot product in the implicit kernel space,
                                                        final classification layer, which produces the out-
i.e., K(oi , oj ) = Φ(oi ) · Φ(oj ). Kernel functions
                                                        put. In particular, the input layer corresponds to
are used by learning algorithms, such as Support
                                                        the input vector ~c, i.e., the row of the C matrix
Vector Machines (Shawe-Taylor and Cristianini,
                                                        associated to an example o. It is then mapped to
2004), to efficiently operate on instances in the
                                                        the Nyström layer, through the projection in Equa-
kernel space: their advantage is that the projec-
                                                        tion 2. Notice that the embedding provides also
tion function Φ(o) = ~x ∈ Rn is never explicitly                                               1

computed. The Nyström method is a factorization        the proper weights, defined by U S − 2 , so that the
method applied to derive a new low-dimensional          mapping can be expressed through the Nyström
                                                                                1
embedding x̃ in a l-dimensional space, with l  n       matrix HN y = U S − 2 : it corresponds to a pre-
so that G ≈ G̃ = X̃ X̃ > , where G = XX > is            training stage based on the SVD. Formally, the
the Gram matrix such that Gij = Φ(oi )Φ(oj ) =          low-dimensional embedding of an input example
                                                            ˜ = ~c HN y = ~c U S − 12 encodes the kernel
                                                        o, ~x
K(oi , oj ). The approximation G̃ is obtained using
a subset of l columns of the matrix, i.e., a selec-     space. Any neural network can then be adopted:
tion of a subset L ⊂ D of the available exam-           in the rest of this paper, we assume that a tradi-
ples, called landmarks. Given l randomly sam-           tional Multi-Layer Perceptron (MLP) architecture
pled columns of G, let C ∈ R|D|×l be the ma-            is stacked in order to solve the targeted classifica-
trix of these sampled columns. Then, we can re-         tion problems. The final layer of KDA is the clas-
arrange the columns and rows of G and define            sification layer whose dimensionality depends on
X = [X1 X2 ] such that:                                 the classification task: it computes a linear classi-
                                                        fication function with a softmax operator.
                       X1> X2
                                             
                 W                        C                A KDA is stimulated by an input vector c which
     G=                            =
              X2> X1 X2> X2             X2> X1          corresponds to the kernel evaluations K(o, li )
                                                        between each example o and the landmarks li .
where W = X1> X1 , i.e., the subset of G that con-      Linguistic kernels (such as Semantic Tree Ker-
tains only landmarks. The Nyström approxima-           nels (Croce et al., 2011)) depend on the syntac-
tion can be defined as:                                 tic/semantic similarity between the x and the sub-
                                                        set of li used for the space reconstruction. We will
               G ≈ G̃ = CW † C >                 (1)    see hereafter how tracing back through relevance
                                                        propagation into a KDA architecture corresponds
 where W † denotes the Moore-Penrose inverse of
                                                        to determine which semantic landmarks contribute
 W . If we apply the Singular Value Decomposition
                                                        mostly to the final output decision.
 (SVD) to W , which is symmetric definite posi-
 tive, we get W = U SV > = U SU > . Then it
                                                        3   Layer-wise Relevance Propagation in
 is straightforward to see that W † = U S −1 U > =
       1     1                                              Kernel-based Deep Architectures
 U S − 2 S − 2 U > and that by substitution G ≈ G̃ =
           1          1
 (CU S − 2 )(CU S − 2 )> = X̃ X̃ > . Given an exam-     Layer-wise Relevance propagation (LRP, pre-
 ple o ∈ D, its new low-dimensional representation      sented in (Bach et al., 2015)) is a framework which
 ˜ is determined by considering the corresponding
~x                                                      allows to decompose the prediction of a deep neu-
 item of C as                                           ral network computed over a sample, e.g. an im-
age, down to relevance scores for the single input             4   Explanatory Models
dimensions, such as a subset of pixels.
   Formally, let f : Rd → R+ be a positive real-               LRP allows the automatic compilation of justifica-
valued function taking a vector ~x ∈ Rd as input: f            tions for the KDA classifications: explanations are
quantifies, for example, the probability of ~x char-           possible using landmarks {`} as examples. The
acterizing a certain class. The Layer-wise Rele-               {`} that the LRP method produces as the most ac-
vance Propagation assigns to each dimension, or                tive elements in layer 0 are semantic analogues of
                                 (1)
feature, xd , a relevance score Rd such that:                  input annotated examples. An Explanatory Model
                                                               is the function in charge of compiling the linguis-
                           P (1)
                   f (x) ≈ d Rd                  (3)           tically fluent explanation of individual analogies
                                                               (or differences) with the input case. The mean-
                                (1)
   Features whose score Rd > 0 (or d Rd < 0)
                                                 (1)           ingfulness of such analogies makes a resulting ex-
correspond to evidence in favor (or against) the               planation clear and should increase the user confi-
output classification. In other words, LRP allows              dence on the system reliability. When a sentence
to identify fragments of the input playing key roles           o is classified, LRP assigns activation scores r`s to
in the decision, by propagating relevance back-                each individual landmark `: let L(+) (or L(−) ) de-
wards. Let us suppose to know the relevance score              note the set of landmarks with positive (or nega-
  (l+1)                                                        tive) activation scores.
Rj      of a neuron j at network layer l + 1, then it
                                           (l,l+1)                Formally, an explanation is characterized by a
can be decomposed into messages Ri←j                 sent to   triple e = hs, C, τ i where s is the input sentence,
neurons i in layer l:                                          C is the predicted label and τ is the modality of the
                (l+1)                                          explanation: τ = +1 for positive (i.e. acceptance)
                        X (l,l+1)
               Rj     =  Ri←j                           (4)
                             i∈(l)
                                                               statements while τ = −1 correspond to rejections
                                                               of the decision C. A landmark ` is positively acti-
Hence the relevance of a neuron i at layer l can be            vated for a given sentence s if there are not more
defined as:                                                    than k − 1 other active landmarks1 `0 whose acti-
               (l)            (l,l+1)
                      X
             Ri =           Ri←j                (5)            vation value is higher than the one for `, i.e.
                          j∈(l+1)
                                                                   |{`0 ∈ L(+) : `0 6= ` ∧ r`s0 ≥ r`s > 0}| < k
Note that 4 and 5 are such that 3 holds. In this
work, we adopted the -rule defined in (Bach et                   A landmark is negatively activated when: |{`0 ∈
                                    (l,l+1)                    L (−) : `0 6= ` ∧ r`s0 ≤ r`s < 0}| < k. Positively
al., 2015) to compute the messages Ri←j , i.e.
                                                               (or negative) active landmarks in Lk are assigned
          (l,l+1)              zij          (l+1)              to an activation value a(`, s) = +1 (−1). For all
        Ri←j        =                     R
                        zj +  · sign(zj ) j                   other not activated landmarks: a(`, s) = 0.
                                                                  Given the explanation e = hs, C, τ i, a landmark
where zij = xi wij and  > 0 is a numerical stabi-             ` whose (known) class is C` is consistent (or in-
lizing term and must be small. Notice that weights             consistent) with e according to the fact that the
wij correspond to weighted activations of input                following function:
neurons. If we apply LRP to a KDA it implic-
itly traces the relevance back to the input layer,                              δ(C` , C) · a(`, q) · τ
i.e. to the landmarks. It thus tracks back syntac-
tic, semantic and lexical relations between a ques-            is positive (or negative, respectively), where
tion and the landmark and it grants high relevance             δ(C 0 , C) = 2δkron (C 0 = C) − 1 and δkron is the
to the relations the network selected as highly dis-           Kronecker delta.
criminating for the class representations it learned;
                                                                  The explanatory model is then a function
note that this is different from similarity in terms
                                                               M(e, Lk ) which maps an explanation e, a sub set
of kernel-function evaluation as the latter is task
                                                               Lk of the active and consistent landmarks L for e
independent whereas LRP scores are not. Notice
                                                               into a sentence in natural language. Of course sev-
also that each landmark is uniquely associated to
                                                               eral definitions for M(e, Lk ) and Lk are possible.
an entry of the input vector ~c, as shown in Sec 2,
and, as a member of the training dataset, it also                  1
                                                                     k is a parameter used to make explanation depending on
corresponds to a known class.                                  not more than k landmarks, denoted by Lk .
A general explanatory model would be:                           of confidence the user has in accepting the state-
                                                               ment, and its corresponding form P (o ∈ C|e),
             
                “ s is C since it is similar to ` ”            i.e. the same quantity in the case the user is pro-
                ∀`  ∈ L+    if τ > 0
             
                                                                vided by the explanation e. The core idea is that
             
             
             
                       k
                                                                semantically coherent and exhaustive explanations
             
             
              “ s is not C since it is different
             
             
M(e, Lk ) =     from ` which is C ”                             must indicate correct classifications whereas inco-
              ∀` ∈ L−
             
                           if τ < 0                            herent or non-existent explanations must hint to-
                        k
             
             
             
                                                               wards wrong classifications. A quantitative mea-
             
             
             
                “ s is C but I don’t know why ”                sure of such an increase (or decrease) in confi-
                if Lk = ∅                                       dence is the Information Gain (IG, (Kononenko
             
                                                                and Bratko, 1991)) of the decision o ∈ C. Notice
             −
where L+ k ,Lk ⊆ Lk are the partitions of landmarks             that IG measures the increase of probability corre-
with positive (and negative) relevance scores in                sponding to correct decisions, and the reduction of
Lk , respectively. Here we provide examples for                 the probability in case the decision is wrong. This
two explanatory models, used during the experi-                 amount suitably addresses the shift in uncertainty
mental evaluation. A first possible model returns               −log2 (P (·)) between two (subjective) estimates,
the analogy only with the (unique) consistent land-             i.e., P (o ∈ C) vs. P (o ∈ C|e).
mark with the highest positive score if τ = 1                      Different explanatory models M can be also
and lowest negative when τ = −1. The ex-                        compared. The relative Information Gain IM
planation of a rejected decision in the Argument                is measured against a collection of explanations
Classification of a Semantic Role Labeling task                 e ∈ TM generated by M and then normalized
(Vanzo et al., 2016), described by the triple e1 =              throughout the collection’s entropy E as follows:
h’vai in camera da letto’, S OURCE B RINGING , −1i,
                                                                                    1 1        X
is:                                                                          IM =                   I(e)
                                                                                    E | TM |
                                                                                                    e∈TM
      I think ”in camera da letto” IS NOT [S OURCE ] of
 [B RINGING ] in ”Vai in camera da letto” (LU:[vai]) since      where I(e) is the IG of each explanation2 .
  it’s different from ”sul tavolino” which is [S OURCE ] of
  [B RINGING ] in “Portami il mio catalogo sul tavolino”
                                                                5       Experimental Evaluation
                         (LU:[porta])                           The effectiveness of the proposed approach has
  The second model uses two active land-                        been measured against two different semantic pro-
marks: one consistent and one contradictory                     cessing tasks, i.e. Question Classification (QC)
with respect to the decision. For the triple                    over the UIUC dataset (Li and Roth, 2006) and Ar-
e1 = h’vai in camera da letto’, G OAL M OTION , 1i              gument Classification in Semantic Role Labeling
the second model produces:                                      (SRL-AC) over the HuRIC dataset (Bastianelli et
                                                                al., 2014; Vanzo et al., 2016). The adopted archi-
 I think ”in camera da letto” IS [G OAL ] of [M OTION ] in
                                                                tecture consisted in a LRP-integrated KDA with 1
  ”Vai in camera da letto” (LU:[vai]) since it recalls ”al
                                                                hidden layers and 500 landmarks for QC, 2 hid-
telefono” which is [G OAL ] of [M OTION ] in ”Vai al telefono
                                                                den layers and 100 landmarks for SRL-AC and a
 e controlla se ci sono messaggi” (LU:[vai]) and it IS NOT
                                                                stabilization-term  = 10e−8 .
      [S OURCE ] of [B RINGING ] since different from ”sul
      tavolino” which is the [S OURCE ] of [B RINGING ] in
                                                                   We defined five quality categories and asso-
  ”Portami il mio catalogo sul tavolino” (LU:[portami])
                                                                ciated each with a value of P (o ∈ C|e), as
                                                                shown in Table 1. Three annotators then inde-
4.1     Evaluation methodology                                  pendently rated explanations generated from a col-
In order to evaluate the impact of the produced ex-             lection composed of an equal number of correct
planations, we defined the following task: given a              and wrong classifications (for a total amount of
classification decision, i.e. the input o is classified         300 and 64 explanations, respectively, for QC and
as C, to measure the impact of the explanation e                SRL-AC). This perfect balancing makes the prior
on the belief that a user exhibits on the statement             probability P (o ∈ C) being 0.5, i.e. maximal en-
“o ∈ C is true”. This information can be mod-                   tropy with a baseline IG = 0 in the [−1, 1] range.
eled through the estimates of the following prob-               Notice that annotators had no information on the
abilities: P (o ∈ C) that characterizes the amount                  2
                                                                        More details are in (Kononenko and Bratko, 1991)
      Category          P (o ∈ C|e)      1−P (o ∈ C|e)        Although explanation seems fairly coherent, it is
       V.Good               0.95             0.05
        Good                 0.8              0.2             actually misleading as ENTITY is the annotated
        Weak                 0.5              0.5             class. This shows how the system may lack of
         Bad                 0.2              0.8             contextual information, as humans do, against in-
      Incoher.              0.05             0.95
                                                              herently ambiguous questions.
Table 1: Posterior probab. w.r.t. quality categories
                                                              5.2    Argument Classification
      Model                 QC               SRL-AC           Evaluation also targeted a second task, that is Ar-
   One landmark            0.548              0.669           gument classification in Semantic Role Labeling
   Two landmarks           0.580              0.784
                                                              (SRL-AC): KDA is here fed with vectors from
Table 2: Information gains for two Explanatory                tree kernel evaluations as discussed in (Croce et
Models applied to the QC and SRL-AC datasets.                 al., 2011). The evaluation is carried out over
                                                              the HuRIC dataset (Vanzo et al., 2016), including
system classification performance, but just knowl-            about 240 domotic commands in Italian, compris-
edge of the explanation dataset entropy.                      ing of about 450 roles. The system has an accuracy
                                                              of 91.2% on about 90 examples, while the training
5.1    Question Classification                                and development set have a size of, respectively,
Experimental evaluations3 showed that both the                270 and 90 examples. We considered 64 explana-
models were able to gain more than half the bit re-           tions for measuring the IG of the two explanation
quired to ascertain whether the network statement             models. Table 2 confirms that both explanatory
is true or not (Table 2). Consider:                           models performed even better than in QC. This is
I think ”What year did Oklahoma become a state ?” refers      due to the narrower linguistic domain (14 frames
to a NUMBER since recalls me ”The film Jaws was made in       are involved) and the clearer boundaries between
                       what year ?”                           classes: annotators seem more sensitive to the ex-
                                                              planatory information to assess the network deci-
Here the model returned a coherent supporting ev-
                                                              sion. Examples of generated sentences are:
idence, a somewhat easy case as for the available
                                                               I think ”con me” is NOT the MANNER of C OTHEME in
discriminative pair, i.e. ”What year”. The sys-
                                                               ”Robot vieni con me nel soggiorno? (LU:[vieni])” since it
tem is able to capture semantic similarities even in
                                                               does NOT recall me ”lentamente” which is MANNER in
poorer conditions, e.g.:
                                                              ”Per favore segui quella persona lentamente (LU:[segui])”.
  I think ”Where is the Mall of the America ?” refers to a
                                                              It is rather COTHEME of C OTHEME since it recalls me
LOCATION since recalls me ”What town was the setting for
                                                                    ”mi” which is C OTHEME in ”Seguimi nel bagno
      The Music Man ?” which refers to a LOCATION.
                                                                                    (LU:[segui])”.
This high quality explanation is achieved even if
with such poor lexical overlap. It seems that richer          6     Conclusion and Future Works
representations are here involved with grammati-
cal and semantic similarity used as the main in-              This paper describes an LRP application to a KDA
formation involved in the decision at hand. Let us            that makes use of analogies as explanations of a
consider:                                                     neural network decision. A methodology to mea-
 I think ”Mexican pesos are worth what in U.S. dollars ?”
                                                              sure the explanation quality has been also pro-
refers to a DESCRIPTION since it recalls me ”What is the
                                                              posed and the experimental evidence confirms the
                   Bernoulli Principle ?”
                                                              effectiveness of the method in increasing the trust
                                                              of a user upon automatic classifications. Future
Here the provided explanation is incoherent, as ex-           work will focus on the selection of subtrees as
pected since the classification is wrong. Now con-            meaningful evidences for the explanation, or on
sider:                                                        the modeling of negative information for disam-
 I think ”What is the sales tax in Minnesota ?” refers to a   biguation as well as on more in depth investigation
  NUMBER since it recalls me ”What is the population of       of the landmark selection policies. Moreover, im-
  Mozambique ?” and does not refer to a ENTITY since          proved experimental scenarios involving users and
         different from ”What is a fear of slime ?”.          dialogues will be also designed, e.g. involving fur-
   3
     For details on KDA performance against the task, see     ther investigation within Semantic Role Labeling,
(Croce et al., 2017)                                          using the method proposed in (Croce et al., 2012).
References                                                  Andrea Vanzo, Danilo Croce, Roberto Basili, and
                                                              Daniele Nardi. 2016. Context-aware spoken lan-
Sebastian Bach, Alexander Binder, Gregoire Mon-               guage understanding for human robot interaction. In
  tavon, Frederick Klauschen, Klaus-Robert Mller,             Proceedings of Third Italian Conference on Compu-
  and Wojciech Samek. 2015. On pixel-wise explana-            tational Linguistics (CLiC-it 2016) & Fifth Evalua-
  tions for non-linear classifier decisions by layer-wise     tion Campaign of Natural Language Processing and
  relevance propagation. PLOS ONE, 10(7).                     Speech Tools for Italian. Final Workshop (EVALITA
                                                              2016), Napoli, Italy, December 5-7, 2016.
Emanuele Bastianelli, Giuseppe Castellucci, Danilo
  Croce, Luca Iocchi, Roberto Basili, and Daniele
  Nardi. 2014. Huric: a human robot interaction
  corpus. In LREC, pages 4519–4526. European Lan-
  guage Resources Association (ELRA).

Michael Collins and Nigel Duffy. 2001. New rank-
  ing algorithms for parsing and tagging: Kernels over
  discrete structures, and the voted perceptron. In Pro-
  ceedings of the 40th Annual Meeting on Association
  for Computational Linguistics (ACL ’02), July 7-12,
  2002, Philadelphia, PA, USA, pages 263–270. Asso-
  ciation for Computational Linguistics, Morristown,
  NJ, USA.

Danilo Croce, Alessandro Moschitti, and Roberto
  Basili. 2011. Structured lexical similarity via con-
  volution kernels on dependency trees. In Proceed-
  ings of the 2011 Conference on Empirical Methods
  in Natural Language Processing, pages 1034–1046.
  Association for Computational Linguistics.

Danilo Croce, Alessandro Moschitti, Roberto Basili,
  and Martha Palmer. 2012. Verb classification us-
  ing distributional similarity in syntactic and seman-
  tic structures. In ACL (1), pages 263–272. The As-
  sociation for Computer Linguistics.

Danilo Croce, Simone Filice, Giuseppe Castellucci,
  and Roberto Basili. 2017. Deep learning in seman-
  tic kernel spaces. In Proceedings of the 55th Annual
  Meeting of the Association for Computational Lin-
  guistics (Volume 1: Long Papers), pages 345–354,
  Vancouver, Canada, July. Association for Computa-
  tional Linguistics.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
  short-term memory. Neural Comput., 9(8):1735–
  1780, November.

Igor Kononenko and Ivan Bratko. 1991. Information-
   based evaluation criterion for classifier’s perfor-
   mance. Machine Learning, 6(1):67–80, Jan.

Hugo Larochelle and Geoffrey E. Hinton. 2010.
  Learning to combine foveal glimpses with a third-
  order boltzmann machine. In Proceedings of Neu-
  ral Information Processing Systems (NIPS), pages
  1243–1251.

Xin Li and Dan Roth. 2006. Learning question clas-
  sifiers: the role of semantic information. Natural
  Language Engineering, 12(3):229–249.

John Shawe-Taylor and Nello Cristianini. 2004. Ker-
  nel Methods for Pattern Analysis. Cambridge Uni-
  versity Press, Cambridge, UK.