<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Solving Bar Exam Questions with Deep Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adebayo Kolawole John</string-name>
          <email>collawolley3@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi Di Caro</string-name>
          <email>dicaro@di.unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guido Boella</string-name>
          <email>guido@di.unito.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>University of Torino</institution>
          ,
          <addr-line>Corso Svizzera 185, Torino, 10149</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>16</volume>
      <issue>2017</issue>
      <abstract>
        <p>In this paper, we present a system which solves a Bar Examination written in Natural Language. The proposed system exploits the recent techniques in Deep Neural Networks which have shown promise in many Natural Language Processing (NLP) applications. We evaluate our system on a real Legal Bar Examination, the United States MultiState Bar Examination (MBE), which is a multi-choice 200questions exam for aspiring lawyers. We show that our system achieves good performance without relying on any external knowledge. Our work comes with an added e ort of curating a small corpus, following similar question answering datasets from the well-known MBE examination. The proposed system beats a TFIDF-based baseline, while showing a strong performance when modi ed for a legal Textual Entailment evaluation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Many tasks in Natural Language Processing (NLP)
involve generation of semantic representation for proper text
understanding. For example, tasks like Textual Entailment
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Question Answering [
        <xref ref-type="bibr" rid="ref11">11, 31</xref>
        ] involve deep semantic
understanding of the text since a popular approach like the
Bag of Words (BOW) has limitations due to natural
language ambiguity.
      </p>
      <p>
        Question Answering (QA) tasks follow the Human
learning and testing process. For instance, a student reads a
course note in order to obtain some facts and background
knowledge. The student then answers any question based
on the facts available to him. This is the main essence of
learning, which is about 'committing to memory' and
'generalizing' to new events. Even though learning seems to
be a natural phenomenon to humans, it is nevertheless still
a challenging goal for computers to replicate. Researchers
working in the Computer Science eld of Machine Learning
(ML) often employ methods to analyze existing data in
order to predict the likelihood of uncertain outcomes. These
methods usually produce results that approximate human
capabilities [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        The term ML is actually a broad term used to describe
supervised or unsupervised approaches for making the
computer identify patterns in our data. Usually, a human
handcrafts some features from the data, and the extracted
features are then shown to the algorithm for it to learn the
latent discriminating features. Finally, the algorithm learns
to predict the outcome of an unseen event. Neural Networks
(NN) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are now extensively used by researchers because
they o er a higher representational power. NN try to mimic
the cognitive system of the human. They have a lot of
interconnected nodes. Each node receives some inputs from the
lower layer nodes, performs a computation on the input by
using some non-linear functions, and lastly, the node
transmits its output to the nodes in the layer above it. Such a
network with many interconnecting layers stacked is called
a Deep Neural Network (DNN) [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>When performed by a human, QA requires some form
of cognitive abilities such as reasoning, meta-cognition, the
contextual perception of abstract concepts, intelligence, and
language comprehension. Although machines are yet to
replicate a strong cognitive ability like a human, nevertheless, the
non-cognitive computational techniques that employ
heuristics and statistical approximation can rightly model most
problems while giving an 'intelligent' result which is close
to that from a human [27]. We leverage this assumption
by taking for granted the cognitive capability comparison to
our system. Instead, the goal is to achieve a result that is
presumed acceptable by a human examiner.</p>
      <p>In the QA task, systems are provided with a text passage
containing some facts or background knowledge, and a
question which is related to that text passage. Furthermore, an
answer to the question is provided. The system is then given
a similar but slightly di erent question and is expected to
answer it from the same background knowledge.</p>
      <p>The remaining part of the paper is organized as follows.
In the next section, we review the related work. This is
followed by a description of the MBE Exam and the corpus
used for the experiment. Next, we describe our approach.
Finally, we describe the experiment and evaluation.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        NNs have shown good performance in many NLP tasks
including QA. The authors in [
        <xref ref-type="bibr" rid="ref12">31, 12</xref>
        ] achieved an excellent
result with DNN for QA. In particular, [31] achieved 100%
accuracy on some tasks.1 Similarly, the work of [26] and the
Answer-Sentence Selection proposed by Feng [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] are also
based on NN. A considerable portion of the QA systems use
1e.g. the single supporting facts and two supporting facts
on BaBi dataset. A similar result was reported for CBT
and Simple Question datasets. The datasets are accessible
at https://research.facebook.com/research/babi/
a synthetic dataset. For example, the dataset in [31] was
generated by simulating time-stepped facts using entity,
location and temporal information, e.g.,
The models in [
        <xref ref-type="bibr" rid="ref12">31, 12, 26</xref>
        ] were trained to memorize factual
information about the entities in a given story, e.g., keeping
track of the where, when, and who information regarding an
entity. Furthermore, the questions are quite simple. Each
question requires only a factoid answer. According to the
authors, it is expected that a question should be
unambiguous [
        <xref ref-type="bibr" rid="ref13">31, 13</xref>
        ]. Bordes et. al., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] utilized a more challenging
dataset. Nevertheless, the questions still require factoid
answers. In particular, the dataset contains list questions, i.e.,
a question with multi-choice answers. The work in [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13,
31</xref>
        ] showcases an array of experiments which is aimed at
examining and estimating the text comprehension capability
of a QA system.
      </p>
      <p>
        Some QA systems exploit external information, i.e., those
available in a knowledge base, a Semantic Net, or the
Internet, for generating a plausible answer to a question. For
instance, some researchers utilized a collection of facts which
have been extracted from a large text collection in form of
Subject-Relation-Object (SVO) triples. The triples are then
stored in a knowledge base [
        <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
        ]. The QA system is
therefore trained to map a question to the relevant fact in the
knowledge base. This often requires transcribing a question
into a format that can easily be matched to the fact in the
knowledge base. The problem with this approach is the
overreliance on a structured set of facts, e.g., (Donald Trump,
is-president-of, United States). Moreover, the SVO triples
may be di cult to curate, the triple extraction algorithms
may overgenerate, and the accuracy for SVO extraction may
not be optimal. Also, there is presently no domain-speci c
collection of SVO fact triple for the Legal domain.
      </p>
      <p>
        A few QA systems address solving a real exam question.
The closest to our work in this regard is QANTA [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] which
learns word and phrase-level representations with a
Recurrent Neural Network (RNN) for identifying an answer that
appears as an entity in the paragraph. The authors in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
presented a system for solving biology questions. Similarly
to QANTA, the paragraphs contain a description of a
biological process, a short question, and two choice answers out
of which only one is the correct answer. Weston et. al., [32,
31] employed a Memory Network for the BAbI tasks.2 The
BAbi task includes the single-supporting fact and
multiplesupporting facts. However,some of the supporting facts are
irrelevant to the answer. Also included are the yes/no
questions, and list/set questions. The Memory Network follows
the Long Short-Term Memory (LSTM) which is a NN that
is capable of retaining information over a longer time-step
than a typical RNN. The McTest challenge proposed by Yin
et. al., [35] is also very related to our work. The essential
differences are the nature of the data used, the long sequences
of both paragraphs, question, and answers in our dataset,
as well as the format that the MBE exam question takes.
      </p>
      <p>However, there is limited prior work in the legal domain
2BaBi dataset is available at https://research.fb.com/
projects/babi/
in this respect. Most of the reviewed systems require a
factoid answer. Furthermore, the datasets are mostly synthetic
datasets, i.e., not a real examination question and answer.
It is a popular saying that the 'Language of Law ' does not
follow the 'Law of Language'. This is because being domain
speci c, legal texts employ legislative terms. For instance,
a sentence may reference another sentence (e.g., an article)
without any explicit link. Also, sentences are generally long
and often come with several clausal dependencies.
Moreover, there is usually a couple of inter- and intra-sentential
anaphora resolution that must be resolved. Wyner [33] lists
several NLP issues regarding the legal domain.</p>
      <p>
        The authors in [
        <xref ref-type="bibr" rid="ref17 ref18">18, 17</xref>
        ] employed a collection of legal text.
The dataset3 was indeed prepared from the Japanese Bar
Examination. The task was proposed as a Textual
Entailment (TE) task. The dataset consists of Japanese Civil
Code articles, some of which were used as the premise t,
and others the hypothesis h. The authors utilized a couple
of handcrafted features which are similar to the BOW
features usually employed for text similarity and IR. Similar
work was done in [29], where the authors mined reference
information from a collection of legal text.
      </p>
      <p>
        The most related work to ours is the work of Biralatei et.
al., [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which makes use of a real legal examination question
set. Speci cally, the authors make use of the USA
MultiState Bar Examination (MBE). In their experiment, they
use 100 real multi-choice answer-question sets. Since each
question has 4 available answers out of which only one is
correct, they proposed a TE solution. By performing a
transformation on a question and each corresponding answer, they
obtained 400 t and h pairs, where t is the background
knowledge giving as the text passage to a question, and h is a
transformed question-answer output. More explicitly, the
transformed question-answer output is a combination of a
question and a possible answer. Consequently, the authors
aimed to see if the transformed text is entailed by a passage.
Analogous to the work described in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the proposed TE
system heavily pro ts from some handcrafted features which
typify a similarity between t and h.
      </p>
      <p>However, handcrafting a feature is an expensive and time
consuming process. It is easy to have noisy features and a
series of ablation test is required to identify the best features.
Also, their approach relies on word-similarity and synonym
substitution using existing knowledge resources like
WordNet and VerbOcean. The authors then compute a
BOWbased similarity feature between t and h. The problem with
this approach is that the BOW-based approaches usually
su er from language ambiguity.4 Furthermore, the approach
assumes that a text passage will have a lot of word overlap
with the transformed h in case there is an entailment. This
assumption is costly and may not hold at all times.
Moreover, some questions require extra knowledge apart from
what can be explicitly deduced from the given passage.
The following example expatiates this point,</p>
      <sec id="sec-2-1">
        <title>Example 2:</title>
        <p>Passage: A truck driver from State A and a bus driver from
State B were involved in a collision in State B that injured
the truck driver. The truck driver led a federal diversity
3Released as part of the COLIEE Legal IR challenge. http:
//webdocs.cs.ualberta.ca/~miyoung2/COLIEE2016/
4e.g. synonymy, polysemy etc.
action in State B based on negligence, seeking $100,000 in
damages from the bus driver.</p>
        <p>Question: What law of negligence should the court apply?
Answer A (false): The court should apply the
federal common law of negligence.</p>
        <p>Answer B (false): The court should apply the
negligence law of State A, the truck driver's state of
citizenship.</p>
        <p>Answer C (false): The court should consider the
negligence law of both State A and State B and
apply the law that the court believes most appropriately
governs negligence in this action.</p>
        <p>Answer D (true): The court should determine which
state's negligence law a state court in State B would
apply and apply that law in this action.</p>
        <p>In example 2 above, the passage represents the context or
some knowledge needed for answering the question. Given
this example, an entailment-based system which focuses on
similarity would fail since answering the question requires
not just the word overlap but an understanding of the
semantics of the underlying texts.</p>
        <p>
          This work seeks to address this issue by proposing a NN
Legal Question Answering (LQA) system which employs a
LSTM to encode and decode the question-answer pair for a
good semantic representation. A LSTM is a type of RNN
with slightly more powerful language modeling capacity and
it has become one of the most successful methods for
endto-end supervised learning. Furthermore, LSTMs exhibit
a memory bank property since they are able to retain
information over many time-steps while also overcoming the
vanishing gradient problem [
          <xref ref-type="bibr" rid="ref14 ref3">14, 32, 3</xref>
          ].
        </p>
        <p>Our goal is to evaluate how well the proposed approach
can perform on a legal text reasoning task, and if the
performance of our model can compete with that of a human.
Generally, MBE examinees are required to correctly pass at
least 125 out of the 200 standard MBE questions. Although
the 125 score benchmark is not absolute, an examinee is
also required to get a certain number of points from the
essay exam. We assume that our model competes if it
obtains a score that is above the MBE nationwide Mean score,
which is computed based on statistical analysis of past MBE
examinations. Table 1 shows the summary statistic of the
national performance for the year 2016.5 The Maximum
score obtained is 188/200, which is around 94%. The
Minimum is 58/200, which is about 29%, and the Mean score is
143/200, which is approximately 71.5%. We also introduce
a new Legal QA corpus, speci ed in two formats which we
describe in the subsequent section, and thereby propose a
new form of Legal Question Answering task.</p>
        <p>
          Many people from outside the ML eld often regard NNs
as black-box whose performance cannot be analyzed. To
assuage this sentiment, we benchmark our system against
a TFIDF baseline which predicts its outcome based on a
TFIDF similarity between the passage, question, and answer
in a way similar to the TE setting of [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. By obtaining a
signi cantly better result than the baseline, we validate the
performance of our system.
        </p>
        <sec id="sec-2-1-1">
          <title>5http://www.ncbex.org/publications/statistics/</title>
          <p>mbe-statistics/
Min Score
Max Score
Mean Score
Median Score
Standard Dev
No of Examinees</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>THE LQA CORPUS</title>
      <p>For a human to answer a question, he has to have some
facts about the question. We can then generally make
deductions using the facts as well as some background knowledge
in order to provide a plausible answer. The question
answering task mimics this simple approach whereby a background
knowledge from which to infer facts is provided. A question
is then given and an examinee has to make a judgment
using these facts. Some questions can be direct, such that the
expected answer is straightforward. E.g., someone who has
access to a book on current a airs can easily answer a
question like 'who is the president of the USA?' -Donald Trump.
However, some questions require more than a set of facts for
someone to be able to answer them correctly. This type of
question requires logic in order to make a deduction from the
available facts. A typical example is the Bar examination.</p>
      <p>The MBE is a six-hour, 200-questions multiple-choice
examination developed by the National Conference of Bar
Examiners (NCBE), and administered by the user jurisdiction
as part of the Bar Examination. The goal of the exam is to
assess the extent to which an examinee can apply
fundamental legal principles and legal reasoning in order to analyze
a given fact pattern.6 The exam is very important for it
is one of a number of measures that the NCBE may use in
determining an aspiring lawyer's competence to practice.
Each data point in the exam is a tuple, S = (P, Q, A14). Here,
P is the passage or background knowledge, Q is the question,
and A is the answer. Since it is a multi-choice exam, there
are four possible options in A, out of which only one is
correct and must be selected as the answer. The exam covers
a wide area of law including Constitutional Law, Contracts
Law, Criminal Law, Evidence, Real Property, Torts, and
Civil Procedure.</p>
      <p>
        Similar to the approach in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], for each A, we also split
S such that we have a separate representation for (P,Q,Ai).
However, since our goal is not a Textual Entailment task, we
ignore any transformation on the text to obtain a t-h pair
as it is the case in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In our case, each question-answer
sample S is represented as 4 mini samples, i.e., s1, s2, s3,
s4 such that each s is also a 4-tuple (P, Q, Ai, F). Where
P,Q,A remains the same and F symbolizes a binary ag for
identifying whether the answer is correct or not. In other
words, the goal is to determine if a speci c answer is suitable
to a question, given a background knowledge. The task is
then formalized as an Answer-Sentence-Selection task.
      </p>
      <sec id="sec-3-1">
        <title>Example 3:</title>
        <p>Passage: An entrepreneur from state A decided to sell hot
sauce to the public, labeling it 'Best Hot Sauce'. A
company incorporated in state B and headquartered in state</p>
        <sec id="sec-3-1-1">
          <title>6http://www.ncbex.org/exams/mbe/</title>
          <p>C sued the entrepreneur in federal court in state C. The
campaign sought $50,000 in damages and alleged that the
entrepreneur's use of the name 'Best Hot Sauce' infringed
the company's federal trademark. The entrepreneur led an
answer denying the allegations, and the parties began
discovery. Six months later, the entrepreneur moved to dismiss
for lack of subject-matter jurisdiction.</p>
          <p>Question:Should the court grant the entrepreneur's
motion?
1. Answer A (True): No, because the complaint's claim
arises under federal law.</p>
          <p>Evidence: The claim asserts federal trademark
infringement, and therefore it arises under federal
law. Subject-matter jurisdiction is proper under
28 U.S.C. $1331 as a general federal-question
action. That statute requires no minimum amount
in controversy, so the amount the company seeks
is irrelevant.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Label: 1</title>
        <p>2. Answer B (False): No, because the entrepreneur
waived the right to challenge subject-matter
jurisdiction by not raising the issue initially by motion or in
the answer.</p>
        <p>Evidence: Under Federal Rule 12(h)(3),
subjectmatter jurisdiction cannot be waived and the court
can determine at any time that it lacks
subjectmatter jurisdiction. Therefore, the fact that the
entrepreneur delayed six months before raising
the lack of subject-matter jurisdiction is
immaterial and the court will not deny his motion on
that basis.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Label: 0</title>
        <p>3. Answer C (False): Yes, because although the claim
arises under federal law, the amount in controversy is
not satis ed.</p>
        <p>Evidence: There is no amount-in-controversy
requirement for actions that arise under federal law.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Label: 0</title>
        <p>4. Answer D (False): Yes, because although there is
diversity the amount in controversy is not satis ed.</p>
        <p>Evidence: Federal Rule 4(e)(2) governs service
on individual defendants and authorizes service
on a person of 'suitable age and discretion' only
when service is made at the defendant's dwelling
or usual place of abode, not at the defendant's
workplace.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Label: 0</title>
        <p>Example 2 shows a sample passage and the corresponding
question and answers. We can see that the option labeled
as 'True' is the only correct answer.</p>
        <p>The second format takes a similar style. However, we
introduce extra knowledge in the form of an explanation
made by an expert to validate why an answer is correct
or not. Each sample is thus a 5-tuple (P, Q, Ai, E, F).
Where P,Q,A,F remains the same and E symbolizes the
extra knowledge which justi es F. We say that E is the
evidence since it justi es or explains why an answer is said to
be correct or incorrect. Example 3 shows the passage,
question, answer along with the evidence which explains why the
answer is correct or wrong. The goal is to make the system
take advantage of the extra knowledge since many questions
cannot be directly inferred from the passage without an
extra information. It can be seen in example 3 that there is
an absence of clear linguistic overlap between the passage
text and the answer text. Also, the passage text contains
less or no information required for answering a question. In
this scenario, an extra information (evidence) may indeed
be helpful for answering the question.</p>
        <p>For the purpose of LQA corpus, we use a random
sample of 550 out of the 600 available passage-question-answer
set from the 1991 MBE-I, 1999-MBE-II, 1998-MBE-III and
some exam practice samples obtained from the examiner.7
We choose to use the exam questions because they are
publicly available and have a gold standard answer. We
prepared the question set in the (P, Q, Ai, F) format explained
earlier, yielding 2200 passage-question-answer- ag.8 For the
second format with extra knowledge E, we obtained 15
annotated passage-question texts. In total, we obtained a set
of 60 question-answer sets in (P, Q, Ai, E, F) format.
Because the number seems quite small, we are working towards
getting annotations for more samples. We rely on the
validity/correctness of the gold standard and annotations
obtained from our sources.
4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>NEURAL REASONING OVER LQA</title>
      <p>
        Recently, NN algorithms such as the RNN [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and LSTM
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have excelled at language modeling tasks. LSTM, a
variant of RNN, is especially powerful since it is robust to
the vanishing gradient problem and has a memory that is
controlled by the input gate, the forget gate, and the output
gate. The LSTM is therefore, able to retain information over
several time steps, i.e., a long sequence of words.
      </p>
      <p>
        LSTMs have been deeply studied [
        <xref ref-type="bibr" rid="ref14">14, 28</xref>
        ] and have
variants like the Memory Networks [32, 31] which is speci cally
wired to retain information over longer sequences. A LSTM
network learns short and long-range contextual information.
      </p>
      <p>
        At each time step t, let an LSTM unit be a collection of
vectors in Rd where d is the memory dimension: an input
gate it, a forget state ft, an output gate ot, a memory cell ct
and a hidden state ht. The ut is a tanh layer that applies
a non-linear function to the received input and creates a
vector of new candidate values that could be added to the
state. The state of any gate can either be open or closed,
represented as [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]. The LSTM transition is represented by
the following equations. (xt is the input vector at time step
t, represents the sigmoid activation function, and is the
element-wise multiplication) :
it =
ft =
ot =
      </p>
      <p>W (i)xt + U (i)ht 1 + b(i) ;
W (f)xt + U (f)ht 1 + b(f) ;</p>
      <p>W (o)xt + U (o)ht 1 + b(o) ;</p>
      <sec id="sec-4-1">
        <title>7http://www.ncbex.org/exams/mbe/</title>
        <p>8Our corpus is available on request
ut = tanh</p>
        <p>W (u)xt + U (u)ht 1 + b(u) ;
ct = it
ut + ft</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>METHODS</title>
      <p>We describe the general framework of our model in this
section. Given a set of inputs, the goal is to nd an input
representation that encodes both the passage P, the
question Q, and the answer A. Our model is essentially a
distributional sentence model which is able to comprehend the
semantics of the input texts. Our model has three key
components, i.e., the encoder module, the interaction module,
and the output module.
5.1</p>
    </sec>
    <sec id="sec-6">
      <title>Input Encoder</title>
      <p>
        At the input layer, we introduce three bi-directional LSTM
(BiLSTM) encoders that read the sequences of P, Q, and
A separately. A BiLSTM is essentially composed of two
LSTMs. One capturing information in one direction from
the rst time step to the last time-step while the other
captures information from the last time-step to the rst.
The outputs of the two LSTMs are then combined to
obtain a nal representation. Here, we represent each word
in the sentences P, Q and A with a d-dimensional vector,
where the vectors are obtained from a word embedding
matrix. Generally, we use the Glove 300-dimensional vectors
obtained after training the Glove algorithm on 840 billion
words [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. In practice, a domain-speci c embedding can be
learned from a collection of legal texts by using an algorithm
like Word2Vec [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. However, our dataset is quite small for
any useful embeddings to be generated with the Word2Vec
algorithm. While building the vocabulary, any citation of
a Law article, e.g, 2.8 U.S.C .&amp; 1331, date or money, e.g.,
$50,000 in a text is represented by a special symbol. Also,
entities such as State A, State B or State C are
automatically identi ed and given a special symbol. Each special
symbol in the vocabulary is associated with a randomly
initialized vector in the embedding matrix. We encode and
obtain the sentence representation of each input text using
equation 2 such that a vector representation that captures
the meaning of each text is learned:
!hi = LST M (hi!1; Pi); i 2 [1; :::; M ]
hi = LST M (hi 1; Pi); i 2 [M; :::; 1]
      </p>
      <p>BiLST M (P ) = [!hi ; hi ]
hp = BiLST M (P )
(2)
5.2</p>
    </sec>
    <sec id="sec-7">
      <title>Interaction Layer</title>
      <p>
        The interaction layer is formalized as a hierarchical
attention layer for reducing the input space from three to two.
Attention is a way of focusing on some important parts of an
input, and has been used extensively in some language
modeling tasks such as machine translation, natural language
inference and document classi cation [
        <xref ref-type="bibr" rid="ref1 ref22">1, 22, 34</xref>
        ]. Essentially,
it is able to identify the parts of a text that are most
important to the overall meaning. We use two forms of attention,
us = tanh(Wshs + bs)
s =
      </p>
      <p>exp(usN uq)</p>
      <p>Pq exp(usN uq)
s = X shs</p>
      <p>q
5.3</p>
    </sec>
    <sec id="sec-8">
      <title>Output layer</title>
      <p>The task can be simpli ed as a binary classi cation task
since an answer either has a label 0 or 1. Because the two
vectors sp and sq are the ensuing representations which can
be regarded as the high-level representation of the
interaction between texts P, Q and A. In supervised learning, when
there is a su cient number of positive and negative
samples for a category of example, we can formalize the task
namely inter and intra attention. The intra attention focuses
on the important words within the same text. Speci cally,
such important words can now be aggregated to compose the
meaning of the text. The implication of this is that we can
use the intra attention to focus on important words
independently for each P, Q, and A text. On the other hand, the
inter attention tries to attend to the important words in one
text conditioned on the intra-attention weighted
representation of the second text. Analogously, the inter attention
allows for an interaction between two texts and ensures that
we focus on words that are most important for representing
the meaning of one text, in the context of the other text.</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we use intra-attention to obtain the
sentence representation as shown in equation (3). Initially, the
encoded sentence (see equation (2)) is rst passed through
a Multi-Layer Perceptron (MLP) Neural Network to get a
hidden representation ui which is then weighted with the
attention vector i across the time steps. The attention vector
i is implemented as a Softmax whose weights sum up to
1, and are used to compute a weighted-average of the last
hidden layers generated after processing each of the input
words.
(3)
(4)
ui = tanh(Wphp + bp)
i =
      </p>
      <p>exp(uiM up)</p>
      <p>Pp exp(uiM up)
hs = X ihi</p>
      <p>p
Here, i signi es each time step in hp. hp is the encoded text,
and M is the number of time-steps in hp. The vector up is
a context vector which may be randomly initialized.</p>
      <p>
        The inter attention follows a similar approach. In
particular, we use it to capture the interaction between the
sentences using equation (4). Speci cally, what this means
is that we can use one inter attention layer to obtain the
interaction between the intra-attention hidden states of the
encoded passage text and that of the encoded question text
(P ! Q). Also, the same attention layer is employed to
capture the interaction between the encoded question text
and encoded answer text (Q ! A). Each of the interactions
generated with the inter-attention produces a high-level
representation of these texts which can now be used for
classication. Put in another way, we obtain two vectors which
summarize the interaction between the input sentences.
as a ranking task, trying to create a margin between the
positive and negative examples, and ranking based on the
margin. There are di erent approaches to the Learning to
Rank task, e.g., Pointwise, Pairwise, and Listwise [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A
Pointwise ranking is straighforward and involves training a
binary classi er, i.e., given a triple of question q, answer a,
and a label y, as (qi, aij , yij ), the ranking function is given
as h(w, (qi, aij )) ) yij . Here, the function creates a
feature vector from the question and answer sample. Also,
w is a vector of model weights.
      </p>
      <p>In order to implement our binary classi er, we concatenate
the vectors sp and sq (see (5)) and then propagate the output
of the concatenation to a MLP where the interaction is fully
modeled. Finally, a Softmax layer is used to distribute the
probability over the labels.</p>
      <p>sconcat = [sp; sq]
Formally, we denote li, i = 1, 2, 3, ..., N-1 as the intermediate
hidden layers, Wi as the i-th weight matrix, and bi as the
i-th bias term. The hidden layer computation of the MLP
can be represented as follows:</p>
      <p>l1 = W1sconcat
li = f (Wili 1 + bi); i = 2; 3; ::::; N
1
yo = f (WN lN 1 + bN )
where yo is the output vector of the last layer, f is a
nonlinear function which, in this work, is the hyperbolic tangent
(tanh) activation function, and N represents the number of
layers in our neural network. The predicted class is obtained
by passing the output vector yo through a softmax layer as
shown in equation (7).</p>
      <p>y^ = Sof tmax(Wcyo + bc)
where yo is the output vector from the outermost tanh layer,
Wc and bc are the weight matrix and bias vector which are
the parameters to be learned by the network, and Softmax
is a non-linear activation function that distributes the class
probabilities as shown in equation 8. y^ is the predicted class.</p>
      <p>P r(y^ = cjy) =</p>
      <p>ey c
PK
k=1 ey k
where k is the weight vector of the k-th class.</p>
    </sec>
    <sec id="sec-9">
      <title>6. SYSTEM EVALUATION</title>
      <p>We now describe the experiment and the result obtained.
Recall that the goal of our model is to identify whether an
answer is correct given a question and a corresponding
passage. This is di erent from the TE task which seeks to
establish whether the hypothesis can be inferred from the
premise.
6.1</p>
    </sec>
    <sec id="sec-10">
      <title>Training Parameter</title>
      <p>
        We implemented our model inspired by the work in [26,
32]. As we have mentioned earlier, instead of encoding both
the passage, question and answer text sequences as one-hot
encoding representations of the token sequences, we used
the pretrained 300-dimensional GloVe vectors [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. We keep
the embedding weights xed throughout the training. The
embedding vectors are obtained from an algorithm which is
based on the distributional hypothesis [30]. The algorithm
(5)
(6)
(7)
(8)
      </p>
      <sec id="sec-10-1">
        <title>System Description</title>
        <p>TFIDF-based</p>
        <p>Our Model
operates such that, when given the contexts of a word, it is
able to predict words that may appear close to that word. It
turns out that it captures many semantics characteristics of
a text, such as similarity and relatedness. It has been widely
applied in numerous NLP tasks. We use the Keras9 Deep
Learning library to prototype our model. The training data
is usually split into 80:10:10 for training, evaluation, and test
respectively. We uniformly use a dropout of 0.20, a batch
size of 8, ADAM optimizer and a learning rate of 0.01. The
model was trained for 20 epochs. Even though we already
apply dropout [25] throughout the model, we also use early
stopping to avoid over- tting, usually stopping the training
after 4 consecutive epochs without any drop in the validation
loss. The model used for testing is the best obtained with
the validation set. We found out that our best model is
achieved by epoch 10 after which, if we continue to train,
we keep getting very high accuracy on the training data
which does not generalize to the validation and test set.
6.2</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Experiment</title>
      <p>We compare the results of our model against a TFIDF
baseline. The TFIDF baseline is based on a simple
assumption that, if we consider the TFIDF scores of the passage
text (i.e., P) on one hand, and the question and answer
texts (i.e., Q + A) on the other hand, a high similarity
between the TFIDF scores indicate relevance of the answer to
the question. The TFIDF feature of (Q + A) is subtracted
from the TFIDF feature of (P) and the resulting vector is
passed through a MLP along with the label. This is a
simple MLP classi cation approach. This is a naive assumption,
however, we consider it an adequate baseline. Speci cally,
we would like to know whether our model is capturing only
word overlap features or actual logic in form of the semantic
of a text. Intuitively, we expect the TFIDF-based model
to capture overlap features. However, a good system must
demonstrate that it captures not just the word overlap
features but also, the semantics and other legal nuances in a
text. Table 2 shows the result obtained from the
experiment. The table shows a comparison of the performance of
our model to a TFIDF-based predictor. Table 3 shows a
comparison of our model to the student performance in the
MBE exam in the year 2016.</p>
      <p>In order to allow for comparison with a few legal TE
systems, we modify our model such that the input space is</p>
      <sec id="sec-11-1">
        <title>9https://github.com/fchollet/keras</title>
        <p>
          reduced to two, i.e., similar to a premise and a hypothesis.
It is also possible to modify the text from our dataset.
Normally, we could join the question text to its corresponding
passage text, and regard it as the premise. We could also
manually rewrite the answer text where possible by
including some phrases from the question text, such that the text
reads sensibly. In that case, we can regard the resulting text
as the hypothesis. This would make the dataset preparation
step similar to the one described in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. However, because
we do not have the dataset of Biralatei et. al., [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], it is
difcult to perform any direct comparison, even though their
work is similar to ours in terms of the domain and data.
Instead, we utilized the Japanese civil codes dataset which has
been released in the context of COLIEE 2014. This dataset
has evolved over the years, and an increasing number of
researchers are evaluating their work using this dataset.
We encode the input texts following the description given
in section 5.1. However, we induce interaction between the
input texts at only one level. What this means is that we
perform only the intra-sentence attention without any need
for the inter-sentence attention. Apart from this modi
cation, every other part of the model remains intact. Table 4
shows the result of our system against three other systems
when evaluated on the COLIEE dataset in the context of
Textual Entailment. The rst and the third are the baseline
systems, i.e., the result reported by the authors in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. The
second is a participant in the COLIEE task [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. We can see
that our model slightly outperforms the reported papers.
6.3
        </p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Discussion</title>
      <p>Table 2 shows the result obtained on the LQA corpus
when the main evaluation was done. We see that our model
signi cantly outperforms a TFIDF baseline. Throughout
the evaluation, we use the standard accuracy metric. To
validate our model, we inspected the questions that were scored
correctly by our models but incorrectly by the TFIDF
baseline. We give one example of such passage-question-pair. In
this particular example, the TFIDF baseline predicted the
wrong label for each of the answer options.</p>
      <sec id="sec-12-1">
        <title>Example 4:</title>
        <p>Passage: After being red, a woman sued her former
employer in federal court, alleging that her supervisor had
discriminated against her on the basis of her sex. The woman's
complaint included a lengthy description of what the
supervisor had said and done over the years, quoting his telephone
calls and emails to her and her own emails to the supervisor's
manager asking for help. The employer moved for summary
judgment, alleging that the woman was a pathological liar
who had led the action and included ctitious documents
in revenge for having been red. Because the woman's
attorney was at a lengthy out-of-state trial when the
summaryjudgment motion was led, he failed to respond to it. The
court, therefore, granted the motion in a one-line order and
entered nal judgment. The woman has appealed.
Question:Is the appellate court likely to uphold the trial
court's ruling?</p>
        <p>Answer A (false): No, because the complaint's
allegations were detailed and speci c.</p>
        <p>Answer B (true): No, because the employer moved
for summary judgment on the basis that the woman
was not credible, creating a factual dispute.</p>
        <p>Answer C (false): Yes, because the woman's failure
to respond to the summary-judgment motion means
that there was no sworn a davit to support her
allegations and supporting documents.</p>
        <p>Answer D (false): Yes, because the woman's failure
to respond to the summary-judgment motion was a
default giving su cient basis to grant the motion.
We can see that predicting a correct answer for this
particular example requires the semantic understanding of the
underlying text. We conclude that this is evidently lacking
in the TFIDF baseline.</p>
        <p>Table 3 compares the result of our model with the
overall performance of students in 2016 NCBE statistics. We
arrive at the percentage score based on the data in Table
1. This is calculated by dividing each score by the total
possible score (200) and then multiplying by 100 in order
to obtain a percentage score. We can see that our model
signi cantly outperforms the minimum student score. Also,
we obtain a better score than the mean student score. We
can see that the model shows an appreciable approximation
of understanding of the legal technical jargon. We expect
to have an improved performance once we have a sizable
legal text collection, which we can use to train the Word2Vec
algorithm for obtaining the embedding matrix for our
vocabulary words. In reality, it is even better if such texts
are related to the MBE exam. This will produce
semantically rich embeddings that will capture many legal terms. In
addition, using extra facts, e.g., as proposed in the second
format of the corpus, should improve the performance since
many extra details for general learning would be captured.
7.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>CONCLUSION</title>
      <p>In this paper, we presented a Legal Question Answering
system using a Deep Neural Network technique. Speci cally,
we employed a LSTM Neural Network which has the ability
to retain information much longer than a conventional
Recurrent Neural Network. We also described a corpus which
has been extracted from the USA MBE exams. We
formalize the task as that of Answer-Sentence-Selection, where the
system selects the correct answer to a question given a
background passage. When compared against a TFIDF baseline,
our model displayed a signi cantly better performance.
Similarly, when compared against the human performance based
on the statistics available from student performance in MBE
Exam. The system obtained a better performance than the
mean student score. The proposed task is di erent from
the Textual Entailment task. However, the system shows a
good result on a textual entailment dataset. In the future,
we would like to obtain more data from Legal tests like the
MBE or any equivalent exams in other countries. We
provided a dataset with more information that explains why an
answer is correct or otherwise. Intuitively, ML algorithms
may learn from the extra information to guide their choice
of answer. However, this part is currently lacking in our
work. In our future work, we would like to explore how we
can improve the performance of our system by
incorporating this evidential information as described in section 3. In
particular, it would be interesting to compare ML models
that take advantage of this information to those who have
no access to such information.</p>
    </sec>
    <sec id="sec-14">
      <title>ACKNOWLEDGEMENT</title>
      <p>Kolawole J. Adebayo has received funding from the Erasmus
Mundus Joint International Doctoral (Ph.D.) programme in
Law, Science and Technology. Luigi Di Caro and Guido
Boella have received funding from the European Union's
H2020 research and innovation programme under the grant
agreement No 690974 for the project "MIREL: MIning and
REasoning with Legal texts". The authors would like to
thank the anonymous reviewers who have suggested ways to
improve the quality of the paper.
[25] Nitish Srivastava, Geo rey E Hinton, Alex</p>
      <p>Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: a simple way to prevent neural networks
from over tting. Journal of Machine Learning</p>
      <p>Research, 15(1):1929{1958, 2014.
[26] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus,
et al. End-to-end memory networks. In Advances in
neural information processing systems, pages
2440{2448, 2015.
[27] Harry Surden. Machine learning and law. 2014.
[28] Kai Sheng Tai, Richard Socher, and Christopher D
Manning. Improved semantic representations from
tree-structured long short-term memory networks.
arXiv preprint arXiv:1503.00075, 2015.
[29] Oanh Thi Tran, Bach Xuan Ngo, Minh Le Nguyen,
and Akira Shimazu. Answering legal questions by
mining reference information. In JSAI International
Symposium on Arti cial Intelligence, pages 214{229.</p>
      <p>Springer, 2013.
[30] Peter D Turney, Patrick Pantel, et al. From frequency
to meaning: Vector space models of semantics. Journal
of arti cial intelligence research, 37(1):141{188, 2010.
[31] Jason Weston, Antoine Bordes, Sumit Chopra,
Alexander M Rush, Bart van Merrienboer, Armand
Joulin, and Tomas Mikolov. Towards ai-complete
question answering: A set of prerequisite toy tasks.
arXiv preprint arXiv:1502.05698, 2015.
[32] Jason Weston, Sumit Chopra, and Antoine Bordes.</p>
      <p>Memory networks. arXiv preprint arXiv:1410.3916,
2014.
[33] Adam Wyner and Wim Peters. On rule extraction
from regulations. In JURIX, volume 11, pages
113{122. Citeseer, 2011.
[34] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
Alex Smola, and Eduard Hovy. Hierarchical attention
networks for document classi cation. In Proceedings of
NAACL-HLT, pages 1480{1489, 2016.
[35] Wenpeng Yin, Sebastian Ebert, and Hinrich Schutze.</p>
      <p>Attention-based convolutional neural network for
machine comprehension. arXiv preprint
arXiv:1602.04341, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Berant</surname>
          </string-name>
          , Vivek Srikumar,
          <string-name>
            <surname>Pei-Chun</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Abby Vander Linden, Brittany Harding, Brad Huang,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Clark</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Modeling biological processes for reading comprehension</article-title>
          .
          <source>In EMNLP</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Antoine</given-names>
            <surname>Bordes</surname>
          </string-name>
          , Nicolas Usunier, Sumit Chopra, and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <article-title>Large-scale simple question answering with memory networks</article-title>
          .
          <source>arXiv preprint arXiv:1506</source>
          .
          <year>02075</year>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Cao</surname>
          </string-name>
          , Tao Qin,
          <string-name>
            <surname>Tie-Yan</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Feng Tsai</surname>
            , and
            <given-names>Hang</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Learning to rank: from pairwise approach to listwise approach</article-title>
          .
          <source>In Proceedings of the 24th international conference on Machine learning</source>
          , pages
          <volume>129</volume>
          {
          <fpage>136</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ido</given-names>
            <surname>Dagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Oren</given-names>
            <surname>Glickman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bernardo</given-names>
            <surname>Magnini</surname>
          </string-name>
          .
          <article-title>The pascal recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classi cation, and recognising tectual entailment</article-title>
          , pages
          <volume>177</volume>
          {
          <fpage>190</fpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Anthony</given-names>
            <surname>Fader</surname>
          </string-name>
          , Janara Christensen, Stephen Soderland, and
          <string-name>
            <surname>Mausam</surname>
            <given-names>Mausam</given-names>
          </string-name>
          .
          <article-title>Open information extraction: The second generation</article-title>
          .
          <source>In IJCAI</source>
          , volume
          <volume>11</volume>
          , pages
          <fpage>3</fpage>
          {
          <fpage>10</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Anthony</given-names>
            <surname>Fader</surname>
          </string-name>
          , Stephen Soderland, and
          <string-name>
            <given-names>Oren</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Identifying relations for open information extraction</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <volume>1535</volume>
          {
          <fpage>1545</fpage>
          . Association for Computational Linguistics,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Laurene</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Fausett</surname>
          </string-name>
          .
          <article-title>Fundamentals of neural networks</article-title>
          .
          <source>Prentice-Hall</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Biralatei</given-names>
            <surname>Fawei</surname>
          </string-name>
          , Adam
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wyner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Je Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          .
          <article-title>Passing a USA national bar exam - a rst experiment</article-title>
          .
          <source>In Legal Knowledge and Information Systems - JURIX</source>
          <year>2015</year>
          :
          <article-title>The Twenty-Eighth Annual Conference</article-title>
          , Braga, Portual,
          <source>December 10-11</source>
          ,
          <year>2015</year>
          , pages
          <fpage>179</fpage>
          {
          <fpage>180</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Minwei</surname>
            <given-names>Feng</given-names>
          </string-name>
          , Bing Xiang, Michael R Glass,
          <string-name>
            <given-names>Lidan</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bowen</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Applying deep learning to answer selection: A study and an open task</article-title>
          .
          <source>In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)</source>
          , pages
          <fpage>813</fpage>
          {
          <fpage>820</fpage>
          . IEEE,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jianfeng</surname>
            <given-names>Gao</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Li</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Gamon</surname>
          </string-name>
          , Xiaodong He, and
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Pantel</surname>
          </string-name>
          .
          <article-title>Modeling interestingness with deep neural networks</article-title>
          ,
          <source>June</source>
          <volume>13</volume>
          2014. US Patent App.
          <volume>14</volume>
          /304,
          <fpage>863</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Karl</given-names>
            <surname>Moritz</surname>
          </string-name>
          <string-name>
            <surname>Hermann</surname>
          </string-name>
          , Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and
          <string-name>
            <given-names>Phil</given-names>
            <surname>Blunsom</surname>
          </string-name>
          .
          <article-title>Teaching machines to read and comprehend</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>1693</fpage>
          {
          <fpage>1701</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Felix</surname>
            <given-names>Hill</given-names>
          </string-name>
          , Antoine Bordes, Sumit Chopra, and
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          .
          <article-title>The goldilocks principle: Reading children's books with explicit memory representations</article-title>
          .
          <source>arXiv preprint arXiv:1511.02301</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <article-title>Jurgen Schmidhuber. Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Mohit</surname>
            <given-names>Iyyer</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan L Boyd-Graber</surname>
          </string-name>
          , Leonardo Max Batista Claudino, Richard Socher, and
          <article-title>Hal Daume III. A neural network for factoid question answering over paragraphs</article-title>
          .
          <source>In EMNLP</source>
          , pages
          <volume>633</volume>
          {
          <fpage>644</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Adebayo</surname>
            <given-names>Kolawole John</given-names>
          </string-name>
          , Luigi Di Caro, Guido Boella, and
          <string-name>
            <given-names>Cesare</given-names>
            <surname>Bartolini</surname>
          </string-name>
          .
          <article-title>An approach to information retrieval and question answering in the legal domain</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Mi-Young</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Ying Xu,
          <string-name>
            <given-names>and Randy</given-names>
            <surname>Goebel</surname>
          </string-name>
          .
          <article-title>A convolutional neural network in legal question answering</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Mi-Young</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Ying Xu,
          <string-name>
            <given-names>and Randy</given-names>
            <surname>Goebel</surname>
          </string-name>
          .
          <article-title>Legal question answering using ranking svm and syntactic/semantic similarity</article-title>
          .
          <source>In JSAI International Symposium on Arti cial Intelligence</source>
          , pages
          <fpage>244</fpage>
          {
          <fpage>258</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Ankit</surname>
            <given-names>Kumar</given-names>
          </string-name>
          , Ozan Irsoy, Jonathan Su, James Bradbury, Robert English, Brian Pierce,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Ondruska</surname>
          </string-name>
          , Ishaan Gulrajani, and Richard Socher.
          <article-title>Ask me anything: Dynamic memory networks for natural language processing</article-title>
          . pages 0{
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>LR</given-names>
            <surname>Medsker and LC Jain</surname>
          </string-name>
          .
          <article-title>Recurrent neural networks</article-title>
          .
          <source>Design and Applications</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <article-title>Je rey Dean. E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Ankur</surname>
            <given-names>P Parikh</given-names>
          </string-name>
          , Oscar Tackstrom,
          <string-name>
            <surname>Dipanjan Das</surname>
            , and
            <given-names>Jakob</given-names>
          </string-name>
          <string-name>
            <surname>Uszkoreit</surname>
          </string-name>
          .
          <article-title>A decomposable attention model for natural language inference</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <year>01933</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Je</surname>
            rey Pennington, Richard Socher, and
            <given-names>Christopher D</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In EMNLP</source>
          , volume
          <volume>14</volume>
          , pages
          <fpage>1532</fpage>
          {
          <fpage>43</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Ju</surname>
          </string-name>
          <article-title>rgen Schmidhuber</article-title>
          .
          <article-title>Deep learning in neural networks: An overview</article-title>
          .
          <source>Neural networks</source>
          ,
          <volume>61</volume>
          :
          <fpage>85</fpage>
          {
          <fpage>117</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>