=Paper=
{{Paper
|id=Vol-2143/paper3
|storemode=property
|title=Solving Bar Exam Questions with Deep Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2143/paper3.pdf
|volume=Vol-2143
|authors=Adebayo Kolawole John,Luigi Di Caro,Guido Boella
|dblpUrl=https://dblp.org/rec/conf/icail/AdebayoCB17
}}
==Solving Bar Exam Questions with Deep Neural Networks==
<pdf width="1500px">https://ceur-ws.org/Vol-2143/paper3.pdf</pdf>
<pre>
     Solving Bar Exam Questions with Deep Neural Networks

             Adebayo Kolawole John                                      Luigi Di Caro                        Guido Boella
               Department of Computer                           Department of Computer                Department of Computer
              Science, University of Torino                    Science, University of Torino         Science, University of Torino
                  Corso Svizzera 185                               Corso Svizzera 185                    Corso Svizzera 185
                  Torino, 10149, Italy                             Torino, 10149, Italy                  Torino, 10149, Italy
            collawolley3@yahoo.com                                  dicaro@di.unito.it                    guido@di.unito.it


ABSTRACT                                                                           crafts some features from the data, and the extracted fea-
In this paper, we present a system which solves a Bar Ex-                          tures are then shown to the algorithm for it to learn the
amination written in Natural Language. The proposed sys-                           latent discriminating features. Finally, the algorithm learns
tem exploits the recent techniques in Deep Neural Networks                         to predict the outcome of an unseen event. Neural Networks
which have shown promise in many Natural Language Pro-                             (NN) [8] are now extensively used by researchers because
cessing (NLP) applications. We evaluate our system on                              they offer a higher representational power. NN try to mimic
a real Legal Bar Examination, the United States Multi-                             the cognitive system of the human. They have a lot of inter-
State Bar Examination (MBE), which is a multi-choice 200-                          connected nodes. Each node receives some inputs from the
questions exam for aspiring lawyers. We show that our sys-                         lower layer nodes, performs a computation on the input by
tem achieves good performance without relying on any ex-                           using some non-linear functions, and lastly, the node trans-
ternal knowledge. Our work comes with an added effort of                           mits its output to the nodes in the layer above it. Such a
curating a small corpus, following similar question answer-                        network with many interconnecting layers stacked is called
ing datasets from the well-known MBE examination. The                              a Deep Neural Network (DNN) [24].
proposed system beats a TFIDF-based baseline, while show-                             When performed by a human, QA requires some form
ing a strong performance when modified for a legal Textual                         of cognitive abilities such as reasoning, meta-cognition, the
Entailment evaluation.                                                             contextual perception of abstract concepts, intelligence, and
                                                                                   language comprehension. Although machines are yet to repli-
                                                                                   cate a strong cognitive ability like a human, nevertheless, the
1.     INTRODUCTION                                                                non-cognitive computational techniques that employ heuris-
   Many tasks in Natural Language Processing (NLP) in-                             tics and statistical approximation can rightly model most
volve generation of semantic representation for proper text                        problems while giving an ’intelligent’ result which is close
understanding. For example, tasks like Textual Entailment                          to that from a human [27]. We leverage this assumption
[5] and Question Answering [11, 31] involve deep semantic                          by taking for granted the cognitive capability comparison to
understanding of the text since a popular approach like the                        our system. Instead, the goal is to achieve a result that is
Bag of Words (BOW) has limitations due to natural lan-                             presumed acceptable by a human examiner.
guage ambiguity.                                                                      In the QA task, systems are provided with a text passage
   Question Answering (QA) tasks follow the Human learn-                           containing some facts or background knowledge, and a ques-
ing and testing process. For instance, a student reads a                           tion which is related to that text passage. Furthermore, an
course note in order to obtain some facts and background                           answer to the question is provided. The system is then given
knowledge. The student then answers any question based                             a similar but slightly different question and is expected to
on the facts available to him. This is the main essence of                         answer it from the same background knowledge.
learning, which is about ’committing to memory’ and ’gen-                             The remaining part of the paper is organized as follows.
eralizing’ to new events. Even though learning seems to                            In the next section, we review the related work. This is
be a natural phenomenon to humans, it is nevertheless still                        followed by a description of the MBE Exam and the corpus
a challenging goal for computers to replicate. Researchers                         used for the experiment. Next, we describe our approach.
working in the Computer Science field of Machine Learning                          Finally, we describe the experiment and evaluation.
(ML) often employ methods to analyze existing data in or-
der to predict the likelihood of uncertain outcomes. These                         2.   RELATED WORK
methods usually produce results that approximate human                               NNs have shown good performance in many NLP tasks
capabilities [19].                                                                 including QA. The authors in [31, 12] achieved an excellent
   The term ML is actually a broad term used to describe                           result with DNN for QA. In particular, [31] achieved 100%
supervised or unsupervised approaches for making the com-                          accuracy on some tasks.1 Similarly, the work of [26] and the
puter identify patterns in our data. Usually, a human hand-                        Answer-Sentence Selection proposed by Feng [10] are also
                                                                                   based on NN. A considerable portion of the QA systems use
In: Proceedings of the Second Workshop on Automated Semantic Analysis of Infor-
                                                                                   1
mation in Legal Text (ASAIL 2017), June 16, 2017, London, UK.                        e.g. the single supporting facts and two supporting facts
Copyright © 2017 held by the authors. Copying permitted for private and academic   on BaBi dataset. A similar result was reported for CBT
purposes.                                                                          and Simple Question datasets. The datasets are accessible
Published at http://ceur-ws.org                                                    at https://research.facebook.com/research/babi/
a synthetic dataset. For example, the dataset in [31] was            in this respect. Most of the reviewed systems require a fac-
generated by simulating time-stepped facts using entity, lo-         toid answer. Furthermore, the datasets are mostly synthetic
cation and temporal information, e.g.,                               datasets, i.e., not a real examination question and answer.
                                                                     It is a popular saying that the ’Language of Law ’ does not
Ex 1:                                                                follow the ’Law of Language’. This is because being domain
1.    James is watching TV in his bedroom                            specific, legal texts employ legislative terms. For instance,
2.    James is Sleeping                                              a sentence may reference another sentence (e.g., an article)
3.    Where is James? -bedroom                                       without any explicit link. Also, sentences are generally long
                                                                     and often come with several clausal dependencies. More-
The models in [31, 12, 26] were trained to memorize factual          over, there is usually a couple of inter- and intra-sentential
information about the entities in a given story, e.g., keeping       anaphora resolution that must be resolved. Wyner [33] lists
track of the where, when, and who information regarding an           several NLP issues regarding the legal domain.
entity. Furthermore, the questions are quite simple. Each               The authors in [18, 17] employed a collection of legal text.
question requires only a factoid answer. According to the            The dataset3 was indeed prepared from the Japanese Bar
authors, it is expected that a question should be unambigu-          Examination. The task was proposed as a Textual Entail-
ous [31, 13]. Bordes et. al., [3] utilized a more challenging        ment (TE) task. The dataset consists of Japanese Civil
dataset. Nevertheless, the questions still require factoid an-       Code articles, some of which were used as the premise t,
swers. In particular, the dataset contains list questions, i.e.,     and others the hypothesis h. The authors utilized a couple
a question with multi-choice answers. The work in [12, 13,           of handcrafted features which are similar to the BOW fea-
31] showcases an array of experiments which is aimed at ex-          tures usually employed for text similarity and IR. Similar
amining and estimating the text comprehension capability             work was done in [29], where the authors mined reference
of a QA system.                                                      information from a collection of legal text.
   Some QA systems exploit external information, i.e., those            The most related work to ours is the work of Biralatei et.
available in a knowledge base, a Semantic Net, or the In-            al., [9] which makes use of a real legal examination question
ternet, for generating a plausible answer to a question. For         set. Specifically, the authors make use of the USA Multi-
instance, some researchers utilized a collection of facts which      State Bar Examination (MBE). In their experiment, they
have been extracted from a large text collection in form of          use 100 real multi-choice answer-question sets. Since each
Subject-Relation-Object (SVO) triples. The triples are then          question has 4 available answers out of which only one is cor-
stored in a knowledge base [7, 6]. The QA system is there-           rect, they proposed a TE solution. By performing a transfor-
fore trained to map a question to the relevant fact in the           mation on a question and each corresponding answer, they
knowledge base. This often requires transcribing a question          obtained 400 t and h pairs, where t is the background knowl-
into a format that can easily be matched to the fact in the          edge giving as the text passage to a question, and h is a
knowledge base. The problem with this approach is the over-          transformed question-answer output. More explicitly, the
reliance on a structured set of facts, e.g., (Donald Trump,          transformed question-answer output is a combination of a
is-president-of, United States). Moreover, the SVO triples           question and a possible answer. Consequently, the authors
may be difficult to curate, the triple extraction algorithms         aimed to see if the transformed text is entailed by a passage.
may overgenerate, and the accuracy for SVO extraction may            Analogous to the work described in [18], the proposed TE
not be optimal. Also, there is presently no domain-specific          system heavily profits from some handcrafted features which
collection of SVO fact triple for the Legal domain.                  typify a similarity between t and h.
   A few QA systems address solving a real exam question.               However, handcrafting a feature is an expensive and time
The closest to our work in this regard is QANTA [15] which           consuming process. It is easy to have noisy features and a se-
learns word and phrase-level representations with a Recur-           ries of ablation test is required to identify the best features.
rent Neural Network (RNN) for identifying an answer that             Also, their approach relies on word-similarity and synonym
appears as an entity in the paragraph. The authors in [2]            substitution using existing knowledge resources like Word-
presented a system for solving biology questions. Similarly          Net and VerbOcean. The authors then compute a BOW-
to QANTA, the paragraphs contain a description of a bio-             based similarity feature between t and h. The problem with
logical process, a short question, and two choice answers out        this approach is that the BOW-based approaches usually
of which only one is the correct answer. Weston et. al., [32,        suffer from language ambiguity.4 Furthermore, the approach
31] employed a Memory Network for the BAbI tasks.2 The               assumes that a text passage will have a lot of word overlap
BAbi task includes the single-supporting fact and multiple-          with the transformed h in case there is an entailment. This
supporting facts. However,some of the supporting facts are           assumption is costly and may not hold at all times. More-
irrelevant to the answer. Also included are the yes/no ques-         over, some questions require extra knowledge apart from
tions, and list/set questions. The Memory Network follows            what can be explicitly deduced from the given passage.
the Long Short-Term Memory (LSTM) which is a NN that                 The following example expatiates this point,
is capable of retaining information over a longer time-step
than a typical RNN. The McTest challenge proposed by Yin
et. al., [35] is also very related to our work. The essential dif-   Example 2:
ferences are the nature of the data used, the long sequences         Passage: A truck driver from State A and a bus driver from
of both paragraphs, question, and answers in our dataset,            State B were involved in a collision in State B that injured
as well as the format that the MBE exam question takes.              the truck driver. The truck driver filed a federal diversity
   However, there is limited prior work in the legal domain          3
                                                                       Released as part of the COLIEE Legal IR challenge. http:
2                                                                    //webdocs.cs.ualberta.ca/˜miyoung2/COLIEE2016/
 BaBi dataset is available at https://research.fb.com/
                                                                     4
projects/babi/                                                         e.g. synonymy, polysemy etc.
                                                                                       Feb (2016)   July (2016)    Total (2016)
action in State B based on negligence, seeking $100,000 in
                                                                        Min Score         72.5           58.6          58.6
damages from the bus driver.                                           Max Score         188.2         187.4          188.2
Question: What law of negligence should the court apply?               Mean Score         135.0         140.3         143.5
                                                                      Median Score        135.2         140.8         138.6
   • Answer A (false): The court should apply the fed-                Standard Dev        15.0           16.7          16.4
     eral common law of negligence.                                  No of Examinees     23,324        46,518         69,842

   • Answer B (false): The court should apply the neg-           Table 1: 2016 MBE National Summary Statistics
     ligence law of State A, the truck driver’s state of citi-   (Based on scaled scores). Note: The values reflect
     zenship.                                                    valid scores available electronically as of 1/18/2017

   • Answer C (false): The court should consider the
     negligence law of both State A and State B and ap-          3.     THE LQA CORPUS
     ply the law that the court believes most appropriately         For a human to answer a question, he has to have some
     governs negligence in this action.                          facts about the question. We can then generally make deduc-
                                                                 tions using the facts as well as some background knowledge
   • Answer D (true): The court should determine which
                                                                 in order to provide a plausible answer. The question answer-
     state’s negligence law a state court in State B would
                                                                 ing task mimics this simple approach whereby a background
     apply and apply that law in this action.
                                                                 knowledge from which to infer facts is provided. A question
In example 2 above, the passage represents the context or        is then given and an examinee has to make a judgment us-
some knowledge needed for answering the question. Given          ing these facts. Some questions can be direct, such that the
this example, an entailment-based system which focuses on        expected answer is straightforward. E.g., someone who has
similarity would fail since answering the question requires      access to a book on current affairs can easily answer a ques-
not just the word overlap but an understanding of the se-        tion like ’who is the president of the USA?’ -Donald Trump.
mantics of the underlying texts.                                 However, some questions require more than a set of facts for
   This work seeks to address this issue by proposing a NN       someone to be able to answer them correctly. This type of
Legal Question Answering (LQA) system which employs a            question requires logic in order to make a deduction from the
LSTM to encode and decode the question-answer pair for a         available facts. A typical example is the Bar examination.
good semantic representation. A LSTM is a type of RNN               The MBE is a six-hour, 200-questions multiple-choice ex-
with slightly more powerful language modeling capacity and       amination developed by the National Conference of Bar Ex-
it has become one of the most successful methods for end-        aminers (NCBE), and administered by the user jurisdiction
to-end supervised learning. Furthermore, LSTMs exhibit           as part of the Bar Examination. The goal of the exam is to
a memory bank property since they are able to retain in-         assess the extent to which an examinee can apply fundamen-
formation over many time-steps while also overcoming the         tal legal principles and legal reasoning in order to analyze
vanishing gradient problem [14, 32, 3].                          a given fact pattern.6 The exam is very important for it
   Our goal is to evaluate how well the proposed approach        is one of a number of measures that the NCBE may use in
can perform on a legal text reasoning task, and if the per-      determining an aspiring lawyer’s competence to practice.
formance of our model can compete with that of a human.          Each data point in the exam is a tuple, S = (P, Q, A41 ). Here,
Generally, MBE examinees are required to correctly pass at       P is the passage or background knowledge, Q is the question,
least 125 out of the 200 standard MBE questions. Although        and A is the answer. Since it is a multi-choice exam, there
the 125 score benchmark is not absolute, an examinee is          are four possible options in A, out of which only one is cor-
also required to get a certain number of points from the         rect and must be selected as the answer. The exam covers
essay exam. We assume that our model competes if it ob-          a wide area of law including Constitutional Law, Contracts
tains a score that is above the MBE nationwide Mean score,       Law, Criminal Law, Evidence, Real Property, Torts, and
which is computed based on statistical analysis of past MBE      Civil Procedure.
examinations. Table 1 shows the summary statistic of the            Similar to the approach in [9], for each A, we also split
national performance for the year 2016.5 The Maximum             S such that we have a separate representation for (P,Q,Ai ).
score obtained is 188/200, which is around 94%. The Mini-        However, since our goal is not a Textual Entailment task, we
mum is 58/200, which is about 29%, and the Mean score is         ignore any transformation on the text to obtain a t-h pair
143/200, which is approximately 71.5%. We also introduce         as it is the case in [9]. In our case, each question-answer
a new Legal QA corpus, specified in two formats which we         sample S is represented as 4 mini samples, i.e., s1 , s2 , s3 ,
describe in the subsequent section, and thereby propose a        s4 such that each s is also a 4-tuple (P, Q, Ai , F). Where
new form of Legal Question Answering task.                       P,Q,A remains the same and F symbolizes a binary flag for
   Many people from outside the ML field often regard NNs        identifying whether the answer is correct or not. In other
as black-box whose performance cannot be analyzed. To            words, the goal is to determine if a specific answer is suitable
assuage this sentiment, we benchmark our system against          to a question, given a background knowledge. The task is
a TFIDF baseline which predicts its outcome based on a           then formalized as an Answer-Sentence-Selection task.
TFIDF similarity between the passage, question, and answer
in a way similar to the TE setting of [9]. By obtaining a        Example 3:
significantly better result than the baseline, we validate the   Passage: An entrepreneur from state A decided to sell hot
performance of our system.                                       sauce to the public, labeling it ’Best Hot Sauce’. A com-
                                                                 pany incorporated in state B and headquartered in state
5
  http://www.ncbex.org/publications/statistics/
                                                                 6
mbe-statistics/                                                      http://www.ncbex.org/exams/mbe/
C sued the entrepreneur in federal court in state C. The        Where P,Q,A,F remains the same and E symbolizes the ex-
campaign sought $50,000 in damages and alleged that the         tra knowledge which justifies F. We say that E is the evi-
entrepreneur’s use of the name ’Best Hot Sauce’ infringed       dence since it justifies or explains why an answer is said to
the company’s federal trademark. The entrepreneur filed an      be correct or incorrect. Example 3 shows the passage, ques-
answer denying the allegations, and the parties began dis-      tion, answer along with the evidence which explains why the
covery. Six months later, the entrepreneur moved to dismiss     answer is correct or wrong. The goal is to make the system
for lack of subject-matter jurisdiction.                        take advantage of the extra knowledge since many questions
                                                                cannot be directly inferred from the passage without an ex-
Question:Should the court grant the entrepreneur’s mo-          tra information. It can be seen in example 3 that there is
tion?                                                           an absence of clear linguistic overlap between the passage
                                                                text and the answer text. Also, the passage text contains
  1. Answer A (True): No, because the complaint’s claim         less or no information required for answering a question. In
     arises under federal law.                                  this scenario, an extra information (evidence) may indeed
                                                                be helpful for answering the question.
        • Evidence: The claim asserts federal trademark
                                                                   For the purpose of LQA corpus, we use a random sam-
          infringement, and therefore it arises under federal
                                                                ple of 550 out of the 600 available passage-question-answer
          law. Subject-matter jurisdiction is proper under
                                                                set from the 1991 MBE-I, 1999-MBE-II, 1998-MBE-III and
          28 U.S.C. $1331 as a general federal-question ac-
                                                                some exam practice samples obtained from the examiner.7
          tion. That statute requires no minimum amount
                                                                We choose to use the exam questions because they are pub-
          in controversy, so the amount the company seeks
                                                                licly available and have a gold standard answer. We pre-
          is irrelevant.
                                                                pared the question set in the (P, Q, Ai , F) format explained
        • Label: 1                                              earlier, yielding 2200 passage-question-answer-flag.8 For the
                                                                second format with extra knowledge E, we obtained 15 an-
  2. Answer B (False): No, because the entrepreneur             notated passage-question texts. In total, we obtained a set
     waived the right to challenge subject-matter jurisdic-     of 60 question-answer sets in (P, Q, Ai , E, F) format. Be-
     tion by not raising the issue initially by motion or in    cause the number seems quite small, we are working towards
     the answer.                                                getting annotations for more samples. We rely on the va-
                                                                lidity/correctness of the gold standard and annotations ob-
        • Evidence: Under Federal Rule 12(h)(3), subject-
                                                                tained from our sources.
          matter jurisdiction cannot be waived and the court
          can determine at any time that it lacks subject-
          matter jurisdiction. Therefore, the fact that the     4.     NEURAL REASONING OVER LQA
          entrepreneur delayed six months before raising           Recently, NN algorithms such as the RNN [20] and LSTM
          the lack of subject-matter jurisdiction is imma-      [14] have excelled at language modeling tasks. LSTM, a
          terial and the court will not deny his motion on      variant of RNN, is especially powerful since it is robust to
          that basis.                                           the vanishing gradient problem and has a memory that is
        • Label: 0                                              controlled by the input gate, the forget gate, and the output
                                                                gate. The LSTM is therefore, able to retain information over
  3. Answer C (False): Yes, because although the claim          several time steps, i.e., a long sequence of words.
     arises under federal law, the amount in controversy is        LSTMs have been deeply studied [14, 28] and have vari-
     not satisfied.                                             ants like the Memory Networks [32, 31] which is specifically
                                                                wired to retain information over longer sequences. A LSTM
        • Evidence: There is no amount-in-controversy re-       network learns short and long-range contextual information.
          quirement for actions that arise under federal law.      At each time step t, let an LSTM unit be a collection of
        • Label: 0                                              vectors in Rd where d is the memory dimension: an input
                                                                gate it , a forget state ft , an output gate ot , a memory cell ct
  4. Answer D (False): Yes, because although there is           and a hidden state ht . The ut is a tanh layer that applies
     diversity the amount in controversy is not satisfied.      a non-linear function to the received input and creates a
                                                                vector of new candidate values that could be added to the
        • Evidence: Federal Rule 4(e)(2) governs service        state. The state of any gate can either be open or closed,
          on individual defendants and authorizes service       represented as [0,1]. The LSTM transition is represented by
          on a person of ’suitable age and discretion’ only     the following equations. (xt is the input vector at time step
          when service is made at the defendant’s dwelling      t, σ represents the sigmoid activation function, and is the
          or usual place of abode, not at the defendant’s       element-wise multiplication) :
          workplace.                                                                                               
        • Label: 0                                                             it = σ W (i) xt + U (i) ht−1 + b(i) ,
                                                                                                                   
Example 2 shows a sample passage and the corresponding
question and answers. We can see that the option labeled                      ft = σ W (f ) xt + U (f ) ht−1 + b(f ) ,
as ’True’ is the only correct answer.                                                                              
  The second format takes a similar style. However, we                         ot = σ W (o) xt + U (o) ht−1 + b(o) ,
introduce extra knowledge in the form of an explanation
                                                                7
made by an expert to validate why an answer is correct              http://www.ncbex.org/exams/mbe/
                                                                8
or not. Each sample is thus a 5-tuple (P, Q, Ai , E, F).            Our corpus is available on request
                                                                namely inter and intra attention. The intra attention focuses
          ut = tanh W (u) xt + U (u) ht−1 + b(u) ,                on the important words within the same text. Specifically,
                                                                  such important words can now be aggregated to compose the
                  ct = it    ut + f t   ct−1 ,                    meaning of the text. The implication of this is that we can
                                                                  use the intra attention to focus on important words indepen-
                                                                  dently for each P, Q, and A text. On the other hand, the
                      ht = o t   tanh ct                   (1)
                                                                  inter attention tries to attend to the important words in one
                                                                  text conditioned on the intra-attention weighted representa-
5.    METHODS                                                     tion of the second text. Analogously, the inter attention
   We describe the general framework of our model in this         allows for an interaction between two texts and ensures that
section. Given a set of inputs, the goal is to find an input      we focus on words that are most important for representing
representation that encodes both the passage P, the ques-         the meaning of one text, in the context of the other text.
tion Q, and the answer A. Our model is essentially a dis-            Following [1], we use intra-attention to obtain the sen-
tributional sentence model which is able to comprehend the        tence representation as shown in equation (3). Initially, the
semantics of the input texts. Our model has three key com-        encoded sentence (see equation (2)) is first passed through
ponents, i.e., the encoder module, the interaction module,        a Multi-Layer Perceptron (MLP) Neural Network to get a
and the output module.                                            hidden representation ui which is then weighted with the at-
                                                                  tention vector αi across the time steps. The attention vector
5.1    Input Encoder                                              αi is implemented as a Softmax whose weights sum up to
   At the input layer, we introduce three bi-directional LSTM     1, and are used to compute a weighted-average of the last
(BiLSTM) encoders that read the sequences of P, Q, and            hidden layers generated after processing each of the input
A separately. A BiLSTM is essentially composed of two             words.
LSTMs. One capturing information in one direction from
                                                                                      ui = tanh(Wp hp + bp )
the first time step to the last time-step while the other
captures information from the last time-step to the first.
                                                                                            exp(uM
                                                                                                 i up )
The outputs of the two LSTMs are then combined to ob-                                 αi = P       M
tain a final representation. Here, we represent each word                                   p exp(ui up )
in the sentences P, Q and A with a d-dimensional vector,                                         X
where the vectors are obtained from a word embedding ma-                                  hs =       αi hi                    (3)
trix. Generally, we use the Glove 300-dimensional vectors                                        p
obtained after training the Glove algorithm on 840 billion        Here, i signifies each time step in hp . hp is the encoded text,
words [23]. In practice, a domain-specific embedding can be       and M is the number of time-steps in hp . The vector up is
learned from a collection of legal texts by using an algorithm    a context vector which may be randomly initialized.
like Word2Vec [21]. However, our dataset is quite small for          The inter attention follows a similar approach. In par-
any useful embeddings to be generated with the Word2Vec           ticular, we use it to capture the interaction between the
algorithm. While building the vocabulary, any citation of         sentences using equation (4). Specifically, what this means
a Law article, e.g, 2.8 U.S.C .& 1331, date or money, e.g.,       is that we can use one inter attention layer to obtain the
$50,000 in a text is represented by a special symbol. Also,       interaction between the intra-attention hidden states of the
entities such as State A, State B or State C are automat-         encoded passage text and that of the encoded question text
ically identified and given a special symbol. Each special        (P → Q). Also, the same attention layer is employed to
symbol in the vocabulary is associated with a randomly ini-       capture the interaction between the encoded question text
tialized vector in the embedding matrix. We encode and            and encoded answer text (Q → A). Each of the interactions
obtain the sentence representation of each input text using       generated with the inter-attention produces a high-level rep-
equation 2 such that a vector representation that captures        resentation of these texts which can now be used for classi-
the meaning of each text is learned:                              fication. Put in another way, we obtain two vectors which
            −
            →              −−→                                    summarize the interaction between the input sentences.
            hi = LST M (hi−1 , Pi ), i ∈ [1, ..., M ]
                                                                                      us = tanh(Ws hs + bs )
           ←
           −           ←−−
           hi = LST M (hi−1 , Pi ),     i ∈ [M, ..., 1]
                                                                                            exp(uNs uq )
                                   −
                                   → ←   −                                            αs = P         Nu )
                   BiLST M (P ) = [ hi ; hi ]                                                q exp(u s q

                                                                                                X
                     hp = BiLST M (P )                     (2)                             s=        αs hs                    (4)
                                                                                                 q
5.2    Interaction Layer
   The interaction layer is formalized as a hierarchical at-      5.3    Output layer
tention layer for reducing the input space from three to two.        The task can be simplified as a binary classification task
Attention is a way of focusing on some important parts of an      since an answer either has a label 0 or 1. Because the two
input, and has been used extensively in some language mod-        vectors sp and sq are the ensuing representations which can
eling tasks such as machine translation, natural language in-     be regarded as the high-level representation of the interac-
ference and document classification [1, 22, 34]. Essentially,     tion between texts P, Q and A. In supervised learning, when
it is able to identify the parts of a text that are most impor-   there is a sufficient number of positive and negative sam-
tant to the overall meaning. We use two forms of attention,       ples for a category of example, we can formalize the task
as a ranking task, trying to create a margin between the                        System Description        (Accuracy %)
positive and negative examples, and ranking based on the                            TFIDF-based               44.80
margin. There are different approaches to the Learning to                            Our Model                71.90
Rank task, e.g., Pointwise, Pairwise, and Listwise [4]. A
Pointwise ranking is straighforward and involves training a               Table 2: Standard Evaluation on LQA dataset
binary classifier, i.e., given a triple of question q, answer a,
and a label y, as (qi , aij , yij ), the ranking function is given
as h(w, ψ (qi , aij )) ⇒ yij . Here, the ψ function creates a                   Human Performance            (Accuracy %)
feature vector from the question and answer sample. Also,                           Minimum                       0.29
w is a vector of model weights.                                                       Mean                       71.50
   In order to implement our binary classifier, we concatenate                      Maximum                      94.00
the vectors sp and sq (see (5)) and then propagate the output                      This paper                    71.90
of the concatenation to a MLP where the interaction is fully
modeled. Finally, a Softmax layer is used to distribute the          Table 3: Comparison with human performance
probability over the labels.                                         (2016 NBEX national statistics)

                        sconcat = [sp ; sq ]                  (5)
Formally, we denote li , i = 1, 2, 3, ..., N-1 as the intermediate
                                                                     operates such that, when given the contexts of a word, it is
hidden layers, Wi as the i-th weight matrix, and bi as the
                                                                     able to predict words that may appear close to that word. It
i-th bias term. The hidden layer computation of the MLP
                                                                     turns out that it captures many semantics characteristics of
can be represented as follows:
                                                                     a text, such as similarity and relatedness. It has been widely
                        l1 = W1 sconcat                              applied in numerous NLP tasks. We use the Keras9 Deep
                                                                     Learning library to prototype our model. The training data
          li = f (Wi li−1 + bi ),   i = 2, 3, ...., N − 1            is usually split into 80:10:10 for training, evaluation, and test
                                                                     respectively. We uniformly use a dropout of 0.20, a batch
                    yo = f (WN lN −1 + bN )                   (6)    size of 8, ADAM optimizer and a learning rate of 0.01. The
                                                                     model was trained for 20 epochs. Even though we already
where yo is the output vector of the last layer, f is a non-         apply dropout [25] throughout the model, we also use early
linear function which, in this work, is the hyperbolic tangent       stopping to avoid over-fitting, usually stopping the training
(tanh) activation function, and N represents the number of           after 4 consecutive epochs without any drop in the validation
layers in our neural network. The predicted class is obtained        loss. The model used for testing is the best obtained with
by passing the output vector yo through a softmax layer as           the validation set. We found out that our best model is
shown in equation (7).                                               achieved by epoch 10 after which, if we continue to train,
                  ŷ = Sof tmax(Wc yo + bc )                  (7)    we keep getting very high accuracy on the training data
                                                                     which does not generalize to the validation and test set.
where yo is the output vector from the outermost tanh layer,
Wc and bc are the weight matrix and bias vector which are            6.2      Experiment
the parameters to be learned by the network, and Softmax                We compare the results of our model against a TFIDF
is a non-linear activation function that distributes the class       baseline. The TFIDF baseline is based on a simple assump-
probabilities as shown in equation 8. ŷ is the predicted class.     tion that, if we consider the TFIDF scores of the passage
                                    eyθc                             text (i.e., P) on one hand, and the question and answer
                  P r(ŷ = c|y) = PK                          (8)    texts (i.e., Q + A) on the other hand, a high similarity be-
                                         yθk
                                   k=1 e                             tween the TFIDF scores indicate relevance of the answer to
where θk is the weight vector of the k-th class.                     the question. The TFIDF feature of (Q + A) is subtracted
                                                                     from the TFIDF feature of (P) and the resulting vector is
                                                                     passed through a MLP along with the label. This is a sim-
6.    SYSTEM EVALUATION                                              ple MLP classification approach. This is a naive assumption,
  We now describe the experiment and the result obtained.            however, we consider it an adequate baseline. Specifically,
Recall that the goal of our model is to identify whether an          we would like to know whether our model is capturing only
answer is correct given a question and a corresponding pas-          word overlap features or actual logic in form of the semantic
sage. This is different from the TE task which seeks to              of a text. Intuitively, we expect the TFIDF-based model
establish whether the hypothesis can be inferred from the            to capture overlap features. However, a good system must
premise.                                                             demonstrate that it captures not just the word overlap fea-
                                                                     tures but also, the semantics and other legal nuances in a
6.1    Training Parameter                                            text. Table 2 shows the result obtained from the experi-
  We implemented our model inspired by the work in [26,              ment. The table shows a comparison of the performance of
32]. As we have mentioned earlier, instead of encoding both          our model to a TFIDF-based predictor. Table 3 shows a
the passage, question and answer text sequences as one-hot           comparison of our model to the student performance in the
encoding representations of the token sequences, we used             MBE exam in the year 2016.
the pretrained 300-dimensional GloVe vectors [23]. We keep              In order to allow for comparison with a few legal TE sys-
the embedding weights fixed throughout the training. The             tems, we modify our model such that the input space is
embedding vectors are obtained from an algorithm which is
                                                                     9
based on the distributional hypothesis [30]. The algorithm               https://github.com/fchollet/keras
                Model              (Accuracy %)                    court, therefore, granted the motion in a one-line order and
            Kim et. al., [18]          55.87                       entered final judgment. The woman has appealed.
           Adebayo et. al., [16]       68.40
            Kim et. al., [18]          67.39                       Question:Is the appellate court likely to uphold the trial
              This paper               71.30                       court’s ruling?

Table 4: Evaluation as Textual Entailment task on                       • Answer A (false): No, because the complaint’s alle-
COLIEE 2014 dataset.                                                      gations were detailed and specific.

                                                                        • Answer B (true): No, because the employer moved
                                                                          for summary judgment on the basis that the woman
reduced to two, i.e., similar to a premise and a hypothesis.              was not credible, creating a factual dispute.
It is also possible to modify the text from our dataset. Nor-
mally, we could join the question text to its corresponding             • Answer C (false): Yes, because the woman’s failure
passage text, and regard it as the premise. We could also                 to respond to the summary-judgment motion means
manually rewrite the answer text where possible by includ-                that there was no sworn affidavit to support her alle-
ing some phrases from the question text, such that the text               gations and supporting documents.
reads sensibly. In that case, we can regard the resulting text
as the hypothesis. This would make the dataset preparation              • Answer D (false): Yes, because the woman’s failure
step similar to the one described in [9]. However, because                to respond to the summary-judgment motion was a
we do not have the dataset of Biralatei et. al., [9], it is dif-          default giving sufficient basis to grant the motion.
ficult to perform any direct comparison, even though their         We can see that predicting a correct answer for this par-
work is similar to ours in terms of the domain and data. In-       ticular example requires the semantic understanding of the
stead, we utilized the Japanese civil codes dataset which has      underlying text. We conclude that this is evidently lacking
been released in the context of COLIEE 2014. This dataset          in the TFIDF baseline.
has evolved over the years, and an increasing number of re-           Table 3 compares the result of our model with the over-
searchers are evaluating their work using this dataset.            all performance of students in 2016 NCBE statistics. We
We encode the input texts following the description given          arrive at the percentage score based on the data in Table
in section 5.1. However, we induce interaction between the         1. This is calculated by dividing each score by the total
input texts at only one level. What this means is that we          possible score (200) and then multiplying by 100 in order
perform only the intra-sentence attention without any need         to obtain a percentage score. We can see that our model
for the inter-sentence attention. Apart from this modifica-        significantly outperforms the minimum student score. Also,
tion, every other part of the model remains intact. Table 4        we obtain a better score than the mean student score. We
shows the result of our system against three other systems         can see that the model shows an appreciable approximation
when evaluated on the COLIEE dataset in the context of             of understanding of the legal technical jargon. We expect
Textual Entailment. The first and the third are the baseline       to have an improved performance once we have a sizable le-
systems, i.e., the result reported by the authors in [17]. The     gal text collection, which we can use to train the Word2Vec
second is a participant in the COLIEE task [16]. We can see        algorithm for obtaining the embedding matrix for our vo-
that our model slightly outperforms the reported papers.           cabulary words. In reality, it is even better if such texts
6.3    Discussion                                                  are related to the MBE exam. This will produce semanti-
                                                                   cally rich embeddings that will capture many legal terms. In
   Table 2 shows the result obtained on the LQA corpus             addition, using extra facts, e.g., as proposed in the second
when the main evaluation was done. We see that our model           format of the corpus, should improve the performance since
significantly outperforms a TFIDF baseline. Throughout             many extra details for general learning would be captured.
the evaluation, we use the standard accuracy metric. To val-
idate our model, we inspected the questions that were scored
correctly by our models but incorrectly by the TFIDF base-         7.    CONCLUSION
line. We give one example of such passage-question-pair. In           In this paper, we presented a Legal Question Answering
this particular example, the TFIDF baseline predicted the          system using a Deep Neural Network technique. Specifically,
wrong label for each of the answer options.                        we employed a LSTM Neural Network which has the ability
                                                                   to retain information much longer than a conventional Re-
Example 4:                                                         current Neural Network. We also described a corpus which
Passage: After being fired, a woman sued her former em-            has been extracted from the USA MBE exams. We formal-
ployer in federal court, alleging that her supervisor had dis-     ize the task as that of Answer-Sentence-Selection, where the
criminated against her on the basis of her sex. The woman’s        system selects the correct answer to a question given a back-
complaint included a lengthy description of what the super-        ground passage. When compared against a TFIDF baseline,
visor had said and done over the years, quoting his telephone      our model displayed a significantly better performance. Sim-
calls and emails to her and her own emails to the supervisor’s     ilarly, when compared against the human performance based
manager asking for help. The employer moved for summary            on the statistics available from student performance in MBE
judgment, alleging that the woman was a pathological liar          Exam. The system obtained a better performance than the
who had filed the action and included fictitious documents         mean student score. The proposed task is different from
in revenge for having been fired. Because the woman’s attor-       the Textual Entailment task. However, the system shows a
ney was at a lengthy out-of-state trial when the summary-          good result on a textual entailment dataset. In the future,
judgment motion was filed, he failed to respond to it. The         we would like to obtain more data from Legal tests like the
MBE or any equivalent exams in other countries. We pro-              In Legal Knowledge and Information Systems - JURIX
vided a dataset with more information that explains why an           2015: The Twenty-Eighth Annual Conference, Braga,
answer is correct or otherwise. Intuitively, ML algorithms           Portual, December 10-11, 2015, pages 179–180, 2015.
may learn from the extra information to guide their choice      [10] Minwei Feng, Bing Xiang, Michael R Glass, Lidan
of answer. However, this part is currently lacking in our            Wang, and Bowen Zhou. Applying deep learning to
work. In our future work, we would like to explore how we            answer selection: A study and an open task. In 2015
can improve the performance of our system by incorporat-             IEEE Workshop on Automatic Speech Recognition and
ing this evidential information as described in section 3. In        Understanding (ASRU), pages 813–820. IEEE, 2015.
particular, it would be interesting to compare ML models        [11] Jianfeng Gao, Li Deng, Michael Gamon, Xiaodong He,
that take advantage of this information to those who have            and Patrick Pantel. Modeling interestingness with
no access to such information.                                       deep neural networks, June 13 2014. US Patent App.
                                                                     14/304,863.
ACKNOWLEDGEMENT                                                 [12] Karl Moritz Hermann, Tomas Kocisky, Edward
                                                                     Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Kolawole J. Adebayo has received funding from the Erasmus
                                                                     Suleyman, and Phil Blunsom. Teaching machines to
Mundus Joint International Doctoral (Ph.D.) programme in
                                                                     read and comprehend. In Advances in Neural
Law, Science and Technology. Luigi Di Caro and Guido
                                                                     Information Processing Systems, pages 1693–1701,
Boella have received funding from the European Union’s
                                                                     2015.
H2020 research and innovation programme under the grant
                                                                [13] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason
agreement No 690974 for the project ”MIREL: MIning and
                                                                     Weston. The goldilocks principle: Reading children’s
REasoning with Legal texts”. The authors would like to
                                                                     books with explicit memory representations. arXiv
thank the anonymous reviewers who have suggested ways to
                                                                     preprint arXiv:1511.02301, 2015.
improve the quality of the paper.
                                                                [14] Sepp Hochreiter and Jürgen Schmidhuber. Long
                                                                     short-term memory. Neural computation,
8.   REFERENCES                                                      9(8):1735–1780, 1997.
 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua                [15] Mohit Iyyer, Jordan L Boyd-Graber, Leonardo
     Bengio. Neural machine translation by jointly learning          Max Batista Claudino, Richard Socher, and Hal
     to align and translate. arXiv preprint                          Daumé III. A neural network for factoid question
     arXiv:1409.0473, 2014.                                          answering over paragraphs. In EMNLP, pages
 [2] Jonathan Berant, Vivek Srikumar, Pei-Chun Chen,                 633–644, 2014.
     Abby Vander Linden, Brittany Harding, Brad Huang,          [16] Adebayo Kolawole John, Luigi Di Caro, Guido Boella,
     Peter Clark, and Christopher D Manning. Modeling                and Cesare Bartolini. An approach to information
     biological processes for reading comprehension. In              retrieval and question answering in the legal domain.
     EMNLP, 2014.                                               [17] Mi-Young Kim, Ying Xu, and Randy Goebel. A
 [3] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and              convolutional neural network in legal question
     Jason Weston. Large-scale simple question answering             answering.
     with memory networks. arXiv preprint                       [18] Mi-Young Kim, Ying Xu, and Randy Goebel. Legal
     arXiv:1506.02075, 2015.                                         question answering using ranking svm and
 [4] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and              syntactic/semantic similarity. In JSAI International
     Hang Li. Learning to rank: from pairwise approach to            Symposium on Artificial Intelligence, pages 244–258.
     listwise approach. In Proceedings of the 24th                   Springer, 2014.
     international conference on Machine learning, pages        [19] Ankit Kumar, Ozan Irsoy, Jonathan Su, James
     129–136. ACM, 2007.                                             Bradbury, Robert English, Brian Pierce, Peter
 [5] Ido Dagan, Oren Glickman, and Bernardo Magnini.                 Ondruska, Ishaan Gulrajani, and Richard Socher. Ask
     The pascal recognising textual entailment challenge.            me anything: Dynamic memory networks for natural
     In Machine learning challenges. evaluating predictive           language processing. pages 0–6.
     uncertainty, visual object classification, and             [20] LR Medsker and LC Jain. Recurrent neural networks.
     recognising tectual entailment, pages 177–190.                  Design and Applications, 2001.
     Springer, 2006.                                            [21] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
 [6] Oren Etzioni, Anthony Fader, Janara Christensen,                Dean. Efficient estimation of word representations in
     Stephen Soderland, and Mausam Mausam. Open                      vector space. arXiv preprint arXiv:1301.3781, 2013.
     information extraction: The second generation. In          [22] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and
     IJCAI, volume 11, pages 3–10, 2011.                             Jakob Uszkoreit. A decomposable attention model for
 [7] Anthony Fader, Stephen Soderland, and Oren Etzioni.             natural language inference. arXiv preprint
     Identifying relations for open information extraction.          arXiv:1606.01933, 2016.
     In Proceedings of the Conference on Empirical Methods      [23] Jeffrey Pennington, Richard Socher, and
     in Natural Language Processing, pages 1535–1545.                Christopher D Manning. Glove: Global vectors for
     Association for Computational Linguistics, 2011.                word representation. In EMNLP, volume 14, pages
 [8] Laurene V Fausett. Fundamentals of neural networks.             1532–43, 2014.
     Prentice-Hall, 1994.                                       [24] Jürgen Schmidhuber. Deep learning in neural
 [9] Biralatei Fawei, Adam Z. Wyner, and Jeff Z. Pan.                networks: An overview. Neural networks, 61:85–117,
     Passing a USA national bar exam - a first experiment.           2015.
[25] Nitish Srivastava, Geoffrey E Hinton, Alex
     Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
     Dropout: a simple way to prevent neural networks
     from overfitting. Journal of Machine Learning
     Research, 15(1):1929–1958, 2014.
[26] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus,
     et al. End-to-end memory networks. In Advances in
     neural information processing systems, pages
     2440–2448, 2015.
[27] Harry Surden. Machine learning and law. 2014.
[28] Kai Sheng Tai, Richard Socher, and Christopher D
     Manning. Improved semantic representations from
     tree-structured long short-term memory networks.
     arXiv preprint arXiv:1503.00075, 2015.
[29] Oanh Thi Tran, Bach Xuan Ngo, Minh Le Nguyen,
     and Akira Shimazu. Answering legal questions by
     mining reference information. In JSAI International
     Symposium on Artificial Intelligence, pages 214–229.
     Springer, 2013.
[30] Peter D Turney, Patrick Pantel, et al. From frequency
     to meaning: Vector space models of semantics. Journal
     of artificial intelligence research, 37(1):141–188, 2010.
[31] Jason Weston, Antoine Bordes, Sumit Chopra,
     Alexander M Rush, Bart van Merriënboer, Armand
     Joulin, and Tomas Mikolov. Towards ai-complete
     question answering: A set of prerequisite toy tasks.
     arXiv preprint arXiv:1502.05698, 2015.
[32] Jason Weston, Sumit Chopra, and Antoine Bordes.
     Memory networks. arXiv preprint arXiv:1410.3916,
     2014.
[33] Adam Wyner and Wim Peters. On rule extraction
     from regulations. In JURIX, volume 11, pages
     113–122. Citeseer, 2011.
[34] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,
     Alex Smola, and Eduard Hovy. Hierarchical attention
     networks for document classification. In Proceedings of
     NAACL-HLT, pages 1480–1489, 2016.
[35] Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze.
     Attention-based convolutional neural network for
     machine comprehension. arXiv preprint
     arXiv:1602.04341, 2016.

</pre>