=Paper=
{{Paper
|id=Vol-2143/paper3
|storemode=property
|title=Solving Bar Exam Questions with Deep Neural Networks
|pdfUrl=https://ceur-ws.org/Vol-2143/paper3.pdf
|volume=Vol-2143
|authors=Adebayo Kolawole John,Luigi Di Caro,Guido Boella
|dblpUrl=https://dblp.org/rec/conf/icail/AdebayoCB17
}}
==Solving Bar Exam Questions with Deep Neural Networks==
Solving Bar Exam Questions with Deep Neural Networks Adebayo Kolawole John Luigi Di Caro Guido Boella Department of Computer Department of Computer Department of Computer Science, University of Torino Science, University of Torino Science, University of Torino Corso Svizzera 185 Corso Svizzera 185 Corso Svizzera 185 Torino, 10149, Italy Torino, 10149, Italy Torino, 10149, Italy collawolley3@yahoo.com dicaro@di.unito.it guido@di.unito.it ABSTRACT crafts some features from the data, and the extracted fea- In this paper, we present a system which solves a Bar Ex- tures are then shown to the algorithm for it to learn the amination written in Natural Language. The proposed sys- latent discriminating features. Finally, the algorithm learns tem exploits the recent techniques in Deep Neural Networks to predict the outcome of an unseen event. Neural Networks which have shown promise in many Natural Language Pro- (NN) [8] are now extensively used by researchers because cessing (NLP) applications. We evaluate our system on they offer a higher representational power. NN try to mimic a real Legal Bar Examination, the United States Multi- the cognitive system of the human. They have a lot of inter- State Bar Examination (MBE), which is a multi-choice 200- connected nodes. Each node receives some inputs from the questions exam for aspiring lawyers. We show that our sys- lower layer nodes, performs a computation on the input by tem achieves good performance without relying on any ex- using some non-linear functions, and lastly, the node trans- ternal knowledge. Our work comes with an added effort of mits its output to the nodes in the layer above it. Such a curating a small corpus, following similar question answer- network with many interconnecting layers stacked is called ing datasets from the well-known MBE examination. The a Deep Neural Network (DNN) [24]. proposed system beats a TFIDF-based baseline, while show- When performed by a human, QA requires some form ing a strong performance when modified for a legal Textual of cognitive abilities such as reasoning, meta-cognition, the Entailment evaluation. contextual perception of abstract concepts, intelligence, and language comprehension. Although machines are yet to repli- cate a strong cognitive ability like a human, nevertheless, the 1. INTRODUCTION non-cognitive computational techniques that employ heuris- Many tasks in Natural Language Processing (NLP) in- tics and statistical approximation can rightly model most volve generation of semantic representation for proper text problems while giving an ’intelligent’ result which is close understanding. For example, tasks like Textual Entailment to that from a human [27]. We leverage this assumption [5] and Question Answering [11, 31] involve deep semantic by taking for granted the cognitive capability comparison to understanding of the text since a popular approach like the our system. Instead, the goal is to achieve a result that is Bag of Words (BOW) has limitations due to natural lan- presumed acceptable by a human examiner. guage ambiguity. In the QA task, systems are provided with a text passage Question Answering (QA) tasks follow the Human learn- containing some facts or background knowledge, and a ques- ing and testing process. For instance, a student reads a tion which is related to that text passage. Furthermore, an course note in order to obtain some facts and background answer to the question is provided. The system is then given knowledge. The student then answers any question based a similar but slightly different question and is expected to on the facts available to him. This is the main essence of answer it from the same background knowledge. learning, which is about ’committing to memory’ and ’gen- The remaining part of the paper is organized as follows. eralizing’ to new events. Even though learning seems to In the next section, we review the related work. This is be a natural phenomenon to humans, it is nevertheless still followed by a description of the MBE Exam and the corpus a challenging goal for computers to replicate. Researchers used for the experiment. Next, we describe our approach. working in the Computer Science field of Machine Learning Finally, we describe the experiment and evaluation. (ML) often employ methods to analyze existing data in or- der to predict the likelihood of uncertain outcomes. These 2. RELATED WORK methods usually produce results that approximate human NNs have shown good performance in many NLP tasks capabilities [19]. including QA. The authors in [31, 12] achieved an excellent The term ML is actually a broad term used to describe result with DNN for QA. In particular, [31] achieved 100% supervised or unsupervised approaches for making the com- accuracy on some tasks.1 Similarly, the work of [26] and the puter identify patterns in our data. Usually, a human hand- Answer-Sentence Selection proposed by Feng [10] are also based on NN. A considerable portion of the QA systems use In: Proceedings of the Second Workshop on Automated Semantic Analysis of Infor- 1 mation in Legal Text (ASAIL 2017), June 16, 2017, London, UK. e.g. the single supporting facts and two supporting facts Copyright © 2017 held by the authors. Copying permitted for private and academic on BaBi dataset. A similar result was reported for CBT purposes. and Simple Question datasets. The datasets are accessible Published at http://ceur-ws.org at https://research.facebook.com/research/babi/ a synthetic dataset. For example, the dataset in [31] was in this respect. Most of the reviewed systems require a fac- generated by simulating time-stepped facts using entity, lo- toid answer. Furthermore, the datasets are mostly synthetic cation and temporal information, e.g., datasets, i.e., not a real examination question and answer. It is a popular saying that the ’Language of Law ’ does not Ex 1: follow the ’Law of Language’. This is because being domain 1. James is watching TV in his bedroom specific, legal texts employ legislative terms. For instance, 2. James is Sleeping a sentence may reference another sentence (e.g., an article) 3. Where is James? -bedroom without any explicit link. Also, sentences are generally long and often come with several clausal dependencies. More- The models in [31, 12, 26] were trained to memorize factual over, there is usually a couple of inter- and intra-sentential information about the entities in a given story, e.g., keeping anaphora resolution that must be resolved. Wyner [33] lists track of the where, when, and who information regarding an several NLP issues regarding the legal domain. entity. Furthermore, the questions are quite simple. Each The authors in [18, 17] employed a collection of legal text. question requires only a factoid answer. According to the The dataset3 was indeed prepared from the Japanese Bar authors, it is expected that a question should be unambigu- Examination. The task was proposed as a Textual Entail- ous [31, 13]. Bordes et. al., [3] utilized a more challenging ment (TE) task. The dataset consists of Japanese Civil dataset. Nevertheless, the questions still require factoid an- Code articles, some of which were used as the premise t, swers. In particular, the dataset contains list questions, i.e., and others the hypothesis h. The authors utilized a couple a question with multi-choice answers. The work in [12, 13, of handcrafted features which are similar to the BOW fea- 31] showcases an array of experiments which is aimed at ex- tures usually employed for text similarity and IR. Similar amining and estimating the text comprehension capability work was done in [29], where the authors mined reference of a QA system. information from a collection of legal text. Some QA systems exploit external information, i.e., those The most related work to ours is the work of Biralatei et. available in a knowledge base, a Semantic Net, or the In- al., [9] which makes use of a real legal examination question ternet, for generating a plausible answer to a question. For set. Specifically, the authors make use of the USA Multi- instance, some researchers utilized a collection of facts which State Bar Examination (MBE). In their experiment, they have been extracted from a large text collection in form of use 100 real multi-choice answer-question sets. Since each Subject-Relation-Object (SVO) triples. The triples are then question has 4 available answers out of which only one is cor- stored in a knowledge base [7, 6]. The QA system is there- rect, they proposed a TE solution. By performing a transfor- fore trained to map a question to the relevant fact in the mation on a question and each corresponding answer, they knowledge base. This often requires transcribing a question obtained 400 t and h pairs, where t is the background knowl- into a format that can easily be matched to the fact in the edge giving as the text passage to a question, and h is a knowledge base. The problem with this approach is the over- transformed question-answer output. More explicitly, the reliance on a structured set of facts, e.g., (Donald Trump, transformed question-answer output is a combination of a is-president-of, United States). Moreover, the SVO triples question and a possible answer. Consequently, the authors may be difficult to curate, the triple extraction algorithms aimed to see if the transformed text is entailed by a passage. may overgenerate, and the accuracy for SVO extraction may Analogous to the work described in [18], the proposed TE not be optimal. Also, there is presently no domain-specific system heavily profits from some handcrafted features which collection of SVO fact triple for the Legal domain. typify a similarity between t and h. A few QA systems address solving a real exam question. However, handcrafting a feature is an expensive and time The closest to our work in this regard is QANTA [15] which consuming process. It is easy to have noisy features and a se- learns word and phrase-level representations with a Recur- ries of ablation test is required to identify the best features. rent Neural Network (RNN) for identifying an answer that Also, their approach relies on word-similarity and synonym appears as an entity in the paragraph. The authors in [2] substitution using existing knowledge resources like Word- presented a system for solving biology questions. Similarly Net and VerbOcean. The authors then compute a BOW- to QANTA, the paragraphs contain a description of a bio- based similarity feature between t and h. The problem with logical process, a short question, and two choice answers out this approach is that the BOW-based approaches usually of which only one is the correct answer. Weston et. al., [32, suffer from language ambiguity.4 Furthermore, the approach 31] employed a Memory Network for the BAbI tasks.2 The assumes that a text passage will have a lot of word overlap BAbi task includes the single-supporting fact and multiple- with the transformed h in case there is an entailment. This supporting facts. However,some of the supporting facts are assumption is costly and may not hold at all times. More- irrelevant to the answer. Also included are the yes/no ques- over, some questions require extra knowledge apart from tions, and list/set questions. The Memory Network follows what can be explicitly deduced from the given passage. the Long Short-Term Memory (LSTM) which is a NN that The following example expatiates this point, is capable of retaining information over a longer time-step than a typical RNN. The McTest challenge proposed by Yin et. al., [35] is also very related to our work. The essential dif- Example 2: ferences are the nature of the data used, the long sequences Passage: A truck driver from State A and a bus driver from of both paragraphs, question, and answers in our dataset, State B were involved in a collision in State B that injured as well as the format that the MBE exam question takes. the truck driver. The truck driver filed a federal diversity However, there is limited prior work in the legal domain 3 Released as part of the COLIEE Legal IR challenge. http: 2 //webdocs.cs.ualberta.ca/˜miyoung2/COLIEE2016/ BaBi dataset is available at https://research.fb.com/ 4 projects/babi/ e.g. synonymy, polysemy etc. Feb (2016) July (2016) Total (2016) action in State B based on negligence, seeking $100,000 in Min Score 72.5 58.6 58.6 damages from the bus driver. Max Score 188.2 187.4 188.2 Question: What law of negligence should the court apply? Mean Score 135.0 140.3 143.5 Median Score 135.2 140.8 138.6 • Answer A (false): The court should apply the fed- Standard Dev 15.0 16.7 16.4 eral common law of negligence. No of Examinees 23,324 46,518 69,842 • Answer B (false): The court should apply the neg- Table 1: 2016 MBE National Summary Statistics ligence law of State A, the truck driver’s state of citi- (Based on scaled scores). Note: The values reflect zenship. valid scores available electronically as of 1/18/2017 • Answer C (false): The court should consider the negligence law of both State A and State B and ap- 3. THE LQA CORPUS ply the law that the court believes most appropriately For a human to answer a question, he has to have some governs negligence in this action. facts about the question. We can then generally make deduc- tions using the facts as well as some background knowledge • Answer D (true): The court should determine which in order to provide a plausible answer. The question answer- state’s negligence law a state court in State B would ing task mimics this simple approach whereby a background apply and apply that law in this action. knowledge from which to infer facts is provided. A question In example 2 above, the passage represents the context or is then given and an examinee has to make a judgment us- some knowledge needed for answering the question. Given ing these facts. Some questions can be direct, such that the this example, an entailment-based system which focuses on expected answer is straightforward. E.g., someone who has similarity would fail since answering the question requires access to a book on current affairs can easily answer a ques- not just the word overlap but an understanding of the se- tion like ’who is the president of the USA?’ -Donald Trump. mantics of the underlying texts. However, some questions require more than a set of facts for This work seeks to address this issue by proposing a NN someone to be able to answer them correctly. This type of Legal Question Answering (LQA) system which employs a question requires logic in order to make a deduction from the LSTM to encode and decode the question-answer pair for a available facts. A typical example is the Bar examination. good semantic representation. A LSTM is a type of RNN The MBE is a six-hour, 200-questions multiple-choice ex- with slightly more powerful language modeling capacity and amination developed by the National Conference of Bar Ex- it has become one of the most successful methods for end- aminers (NCBE), and administered by the user jurisdiction to-end supervised learning. Furthermore, LSTMs exhibit as part of the Bar Examination. The goal of the exam is to a memory bank property since they are able to retain in- assess the extent to which an examinee can apply fundamen- formation over many time-steps while also overcoming the tal legal principles and legal reasoning in order to analyze vanishing gradient problem [14, 32, 3]. a given fact pattern.6 The exam is very important for it Our goal is to evaluate how well the proposed approach is one of a number of measures that the NCBE may use in can perform on a legal text reasoning task, and if the per- determining an aspiring lawyer’s competence to practice. formance of our model can compete with that of a human. Each data point in the exam is a tuple, S = (P, Q, A41 ). Here, Generally, MBE examinees are required to correctly pass at P is the passage or background knowledge, Q is the question, least 125 out of the 200 standard MBE questions. Although and A is the answer. Since it is a multi-choice exam, there the 125 score benchmark is not absolute, an examinee is are four possible options in A, out of which only one is cor- also required to get a certain number of points from the rect and must be selected as the answer. The exam covers essay exam. We assume that our model competes if it ob- a wide area of law including Constitutional Law, Contracts tains a score that is above the MBE nationwide Mean score, Law, Criminal Law, Evidence, Real Property, Torts, and which is computed based on statistical analysis of past MBE Civil Procedure. examinations. Table 1 shows the summary statistic of the Similar to the approach in [9], for each A, we also split national performance for the year 2016.5 The Maximum S such that we have a separate representation for (P,Q,Ai ). score obtained is 188/200, which is around 94%. The Mini- However, since our goal is not a Textual Entailment task, we mum is 58/200, which is about 29%, and the Mean score is ignore any transformation on the text to obtain a t-h pair 143/200, which is approximately 71.5%. We also introduce as it is the case in [9]. In our case, each question-answer a new Legal QA corpus, specified in two formats which we sample S is represented as 4 mini samples, i.e., s1 , s2 , s3 , describe in the subsequent section, and thereby propose a s4 such that each s is also a 4-tuple (P, Q, Ai , F). Where new form of Legal Question Answering task. P,Q,A remains the same and F symbolizes a binary flag for Many people from outside the ML field often regard NNs identifying whether the answer is correct or not. In other as black-box whose performance cannot be analyzed. To words, the goal is to determine if a specific answer is suitable assuage this sentiment, we benchmark our system against to a question, given a background knowledge. The task is a TFIDF baseline which predicts its outcome based on a then formalized as an Answer-Sentence-Selection task. TFIDF similarity between the passage, question, and answer in a way similar to the TE setting of [9]. By obtaining a Example 3: significantly better result than the baseline, we validate the Passage: An entrepreneur from state A decided to sell hot performance of our system. sauce to the public, labeling it ’Best Hot Sauce’. A com- pany incorporated in state B and headquartered in state 5 http://www.ncbex.org/publications/statistics/ 6 mbe-statistics/ http://www.ncbex.org/exams/mbe/ C sued the entrepreneur in federal court in state C. The Where P,Q,A,F remains the same and E symbolizes the ex- campaign sought $50,000 in damages and alleged that the tra knowledge which justifies F. We say that E is the evi- entrepreneur’s use of the name ’Best Hot Sauce’ infringed dence since it justifies or explains why an answer is said to the company’s federal trademark. The entrepreneur filed an be correct or incorrect. Example 3 shows the passage, ques- answer denying the allegations, and the parties began dis- tion, answer along with the evidence which explains why the covery. Six months later, the entrepreneur moved to dismiss answer is correct or wrong. The goal is to make the system for lack of subject-matter jurisdiction. take advantage of the extra knowledge since many questions cannot be directly inferred from the passage without an ex- Question:Should the court grant the entrepreneur’s mo- tra information. It can be seen in example 3 that there is tion? an absence of clear linguistic overlap between the passage text and the answer text. Also, the passage text contains 1. Answer A (True): No, because the complaint’s claim less or no information required for answering a question. In arises under federal law. this scenario, an extra information (evidence) may indeed be helpful for answering the question. • Evidence: The claim asserts federal trademark For the purpose of LQA corpus, we use a random sam- infringement, and therefore it arises under federal ple of 550 out of the 600 available passage-question-answer law. Subject-matter jurisdiction is proper under set from the 1991 MBE-I, 1999-MBE-II, 1998-MBE-III and 28 U.S.C. $1331 as a general federal-question ac- some exam practice samples obtained from the examiner.7 tion. That statute requires no minimum amount We choose to use the exam questions because they are pub- in controversy, so the amount the company seeks licly available and have a gold standard answer. We pre- is irrelevant. pared the question set in the (P, Q, Ai , F) format explained • Label: 1 earlier, yielding 2200 passage-question-answer-flag.8 For the second format with extra knowledge E, we obtained 15 an- 2. Answer B (False): No, because the entrepreneur notated passage-question texts. In total, we obtained a set waived the right to challenge subject-matter jurisdic- of 60 question-answer sets in (P, Q, Ai , E, F) format. Be- tion by not raising the issue initially by motion or in cause the number seems quite small, we are working towards the answer. getting annotations for more samples. We rely on the va- lidity/correctness of the gold standard and annotations ob- • Evidence: Under Federal Rule 12(h)(3), subject- tained from our sources. matter jurisdiction cannot be waived and the court can determine at any time that it lacks subject- matter jurisdiction. Therefore, the fact that the 4. NEURAL REASONING OVER LQA entrepreneur delayed six months before raising Recently, NN algorithms such as the RNN [20] and LSTM the lack of subject-matter jurisdiction is imma- [14] have excelled at language modeling tasks. LSTM, a terial and the court will not deny his motion on variant of RNN, is especially powerful since it is robust to that basis. the vanishing gradient problem and has a memory that is • Label: 0 controlled by the input gate, the forget gate, and the output gate. The LSTM is therefore, able to retain information over 3. Answer C (False): Yes, because although the claim several time steps, i.e., a long sequence of words. arises under federal law, the amount in controversy is LSTMs have been deeply studied [14, 28] and have vari- not satisfied. ants like the Memory Networks [32, 31] which is specifically wired to retain information over longer sequences. A LSTM • Evidence: There is no amount-in-controversy re- network learns short and long-range contextual information. quirement for actions that arise under federal law. At each time step t, let an LSTM unit be a collection of • Label: 0 vectors in Rd where d is the memory dimension: an input gate it , a forget state ft , an output gate ot , a memory cell ct 4. Answer D (False): Yes, because although there is and a hidden state ht . The ut is a tanh layer that applies diversity the amount in controversy is not satisfied. a non-linear function to the received input and creates a vector of new candidate values that could be added to the • Evidence: Federal Rule 4(e)(2) governs service state. The state of any gate can either be open or closed, on individual defendants and authorizes service represented as [0,1]. The LSTM transition is represented by on a person of ’suitable age and discretion’ only the following equations. (xt is the input vector at time step when service is made at the defendant’s dwelling t, σ represents the sigmoid activation function, and is the or usual place of abode, not at the defendant’s element-wise multiplication) : workplace. • Label: 0 it = σ W (i) xt + U (i) ht−1 + b(i) , Example 2 shows a sample passage and the corresponding question and answers. We can see that the option labeled ft = σ W (f ) xt + U (f ) ht−1 + b(f ) , as ’True’ is the only correct answer. The second format takes a similar style. However, we ot = σ W (o) xt + U (o) ht−1 + b(o) , introduce extra knowledge in the form of an explanation 7 made by an expert to validate why an answer is correct http://www.ncbex.org/exams/mbe/ 8 or not. Each sample is thus a 5-tuple (P, Q, Ai , E, F). Our corpus is available on request namely inter and intra attention. The intra attention focuses ut = tanh W (u) xt + U (u) ht−1 + b(u) , on the important words within the same text. Specifically, such important words can now be aggregated to compose the ct = it ut + f t ct−1 , meaning of the text. The implication of this is that we can use the intra attention to focus on important words indepen- dently for each P, Q, and A text. On the other hand, the ht = o t tanh ct (1) inter attention tries to attend to the important words in one text conditioned on the intra-attention weighted representa- 5. METHODS tion of the second text. Analogously, the inter attention We describe the general framework of our model in this allows for an interaction between two texts and ensures that section. Given a set of inputs, the goal is to find an input we focus on words that are most important for representing representation that encodes both the passage P, the ques- the meaning of one text, in the context of the other text. tion Q, and the answer A. Our model is essentially a dis- Following [1], we use intra-attention to obtain the sen- tributional sentence model which is able to comprehend the tence representation as shown in equation (3). Initially, the semantics of the input texts. Our model has three key com- encoded sentence (see equation (2)) is first passed through ponents, i.e., the encoder module, the interaction module, a Multi-Layer Perceptron (MLP) Neural Network to get a and the output module. hidden representation ui which is then weighted with the at- tention vector αi across the time steps. The attention vector 5.1 Input Encoder αi is implemented as a Softmax whose weights sum up to At the input layer, we introduce three bi-directional LSTM 1, and are used to compute a weighted-average of the last (BiLSTM) encoders that read the sequences of P, Q, and hidden layers generated after processing each of the input A separately. A BiLSTM is essentially composed of two words. LSTMs. One capturing information in one direction from ui = tanh(Wp hp + bp ) the first time step to the last time-step while the other captures information from the last time-step to the first. exp(uM i up ) The outputs of the two LSTMs are then combined to ob- αi = P M tain a final representation. Here, we represent each word p exp(ui up ) in the sentences P, Q and A with a d-dimensional vector, X where the vectors are obtained from a word embedding ma- hs = αi hi (3) trix. Generally, we use the Glove 300-dimensional vectors p obtained after training the Glove algorithm on 840 billion Here, i signifies each time step in hp . hp is the encoded text, words [23]. In practice, a domain-specific embedding can be and M is the number of time-steps in hp . The vector up is learned from a collection of legal texts by using an algorithm a context vector which may be randomly initialized. like Word2Vec [21]. However, our dataset is quite small for The inter attention follows a similar approach. In par- any useful embeddings to be generated with the Word2Vec ticular, we use it to capture the interaction between the algorithm. While building the vocabulary, any citation of sentences using equation (4). Specifically, what this means a Law article, e.g, 2.8 U.S.C .& 1331, date or money, e.g., is that we can use one inter attention layer to obtain the $50,000 in a text is represented by a special symbol. Also, interaction between the intra-attention hidden states of the entities such as State A, State B or State C are automat- encoded passage text and that of the encoded question text ically identified and given a special symbol. Each special (P → Q). Also, the same attention layer is employed to symbol in the vocabulary is associated with a randomly ini- capture the interaction between the encoded question text tialized vector in the embedding matrix. We encode and and encoded answer text (Q → A). Each of the interactions obtain the sentence representation of each input text using generated with the inter-attention produces a high-level rep- equation 2 such that a vector representation that captures resentation of these texts which can now be used for classi- the meaning of each text is learned: fication. Put in another way, we obtain two vectors which − → −−→ summarize the interaction between the input sentences. hi = LST M (hi−1 , Pi ), i ∈ [1, ..., M ] us = tanh(Ws hs + bs ) ← − ←−− hi = LST M (hi−1 , Pi ), i ∈ [M, ..., 1] exp(uNs uq ) − → ← − αs = P Nu ) BiLST M (P ) = [ hi ; hi ] q exp(u s q X hp = BiLST M (P ) (2) s= αs hs (4) q 5.2 Interaction Layer The interaction layer is formalized as a hierarchical at- 5.3 Output layer tention layer for reducing the input space from three to two. The task can be simplified as a binary classification task Attention is a way of focusing on some important parts of an since an answer either has a label 0 or 1. Because the two input, and has been used extensively in some language mod- vectors sp and sq are the ensuing representations which can eling tasks such as machine translation, natural language in- be regarded as the high-level representation of the interac- ference and document classification [1, 22, 34]. Essentially, tion between texts P, Q and A. In supervised learning, when it is able to identify the parts of a text that are most impor- there is a sufficient number of positive and negative sam- tant to the overall meaning. We use two forms of attention, ples for a category of example, we can formalize the task as a ranking task, trying to create a margin between the System Description (Accuracy %) positive and negative examples, and ranking based on the TFIDF-based 44.80 margin. There are different approaches to the Learning to Our Model 71.90 Rank task, e.g., Pointwise, Pairwise, and Listwise [4]. A Pointwise ranking is straighforward and involves training a Table 2: Standard Evaluation on LQA dataset binary classifier, i.e., given a triple of question q, answer a, and a label y, as (qi , aij , yij ), the ranking function is given as h(w, ψ (qi , aij )) ⇒ yij . Here, the ψ function creates a Human Performance (Accuracy %) feature vector from the question and answer sample. Also, Minimum 0.29 w is a vector of model weights. Mean 71.50 In order to implement our binary classifier, we concatenate Maximum 94.00 the vectors sp and sq (see (5)) and then propagate the output This paper 71.90 of the concatenation to a MLP where the interaction is fully modeled. Finally, a Softmax layer is used to distribute the Table 3: Comparison with human performance probability over the labels. (2016 NBEX national statistics) sconcat = [sp ; sq ] (5) Formally, we denote li , i = 1, 2, 3, ..., N-1 as the intermediate operates such that, when given the contexts of a word, it is hidden layers, Wi as the i-th weight matrix, and bi as the able to predict words that may appear close to that word. It i-th bias term. The hidden layer computation of the MLP turns out that it captures many semantics characteristics of can be represented as follows: a text, such as similarity and relatedness. It has been widely l1 = W1 sconcat applied in numerous NLP tasks. We use the Keras9 Deep Learning library to prototype our model. The training data li = f (Wi li−1 + bi ), i = 2, 3, ...., N − 1 is usually split into 80:10:10 for training, evaluation, and test respectively. We uniformly use a dropout of 0.20, a batch yo = f (WN lN −1 + bN ) (6) size of 8, ADAM optimizer and a learning rate of 0.01. The model was trained for 20 epochs. Even though we already where yo is the output vector of the last layer, f is a non- apply dropout [25] throughout the model, we also use early linear function which, in this work, is the hyperbolic tangent stopping to avoid over-fitting, usually stopping the training (tanh) activation function, and N represents the number of after 4 consecutive epochs without any drop in the validation layers in our neural network. The predicted class is obtained loss. The model used for testing is the best obtained with by passing the output vector yo through a softmax layer as the validation set. We found out that our best model is shown in equation (7). achieved by epoch 10 after which, if we continue to train, ŷ = Sof tmax(Wc yo + bc ) (7) we keep getting very high accuracy on the training data which does not generalize to the validation and test set. where yo is the output vector from the outermost tanh layer, Wc and bc are the weight matrix and bias vector which are 6.2 Experiment the parameters to be learned by the network, and Softmax We compare the results of our model against a TFIDF is a non-linear activation function that distributes the class baseline. The TFIDF baseline is based on a simple assump- probabilities as shown in equation 8. ŷ is the predicted class. tion that, if we consider the TFIDF scores of the passage eyθc text (i.e., P) on one hand, and the question and answer P r(ŷ = c|y) = PK (8) texts (i.e., Q + A) on the other hand, a high similarity be- yθk k=1 e tween the TFIDF scores indicate relevance of the answer to where θk is the weight vector of the k-th class. the question. The TFIDF feature of (Q + A) is subtracted from the TFIDF feature of (P) and the resulting vector is passed through a MLP along with the label. This is a sim- 6. SYSTEM EVALUATION ple MLP classification approach. This is a naive assumption, We now describe the experiment and the result obtained. however, we consider it an adequate baseline. Specifically, Recall that the goal of our model is to identify whether an we would like to know whether our model is capturing only answer is correct given a question and a corresponding pas- word overlap features or actual logic in form of the semantic sage. This is different from the TE task which seeks to of a text. Intuitively, we expect the TFIDF-based model establish whether the hypothesis can be inferred from the to capture overlap features. However, a good system must premise. demonstrate that it captures not just the word overlap fea- tures but also, the semantics and other legal nuances in a 6.1 Training Parameter text. Table 2 shows the result obtained from the experi- We implemented our model inspired by the work in [26, ment. The table shows a comparison of the performance of 32]. As we have mentioned earlier, instead of encoding both our model to a TFIDF-based predictor. Table 3 shows a the passage, question and answer text sequences as one-hot comparison of our model to the student performance in the encoding representations of the token sequences, we used MBE exam in the year 2016. the pretrained 300-dimensional GloVe vectors [23]. We keep In order to allow for comparison with a few legal TE sys- the embedding weights fixed throughout the training. The tems, we modify our model such that the input space is embedding vectors are obtained from an algorithm which is 9 based on the distributional hypothesis [30]. The algorithm https://github.com/fchollet/keras Model (Accuracy %) court, therefore, granted the motion in a one-line order and Kim et. al., [18] 55.87 entered final judgment. The woman has appealed. Adebayo et. al., [16] 68.40 Kim et. al., [18] 67.39 Question:Is the appellate court likely to uphold the trial This paper 71.30 court’s ruling? Table 4: Evaluation as Textual Entailment task on • Answer A (false): No, because the complaint’s alle- COLIEE 2014 dataset. gations were detailed and specific. • Answer B (true): No, because the employer moved for summary judgment on the basis that the woman reduced to two, i.e., similar to a premise and a hypothesis. was not credible, creating a factual dispute. It is also possible to modify the text from our dataset. Nor- mally, we could join the question text to its corresponding • Answer C (false): Yes, because the woman’s failure passage text, and regard it as the premise. We could also to respond to the summary-judgment motion means manually rewrite the answer text where possible by includ- that there was no sworn affidavit to support her alle- ing some phrases from the question text, such that the text gations and supporting documents. reads sensibly. In that case, we can regard the resulting text as the hypothesis. This would make the dataset preparation • Answer D (false): Yes, because the woman’s failure step similar to the one described in [9]. However, because to respond to the summary-judgment motion was a we do not have the dataset of Biralatei et. al., [9], it is dif- default giving sufficient basis to grant the motion. ficult to perform any direct comparison, even though their We can see that predicting a correct answer for this par- work is similar to ours in terms of the domain and data. In- ticular example requires the semantic understanding of the stead, we utilized the Japanese civil codes dataset which has underlying text. We conclude that this is evidently lacking been released in the context of COLIEE 2014. This dataset in the TFIDF baseline. has evolved over the years, and an increasing number of re- Table 3 compares the result of our model with the over- searchers are evaluating their work using this dataset. all performance of students in 2016 NCBE statistics. We We encode the input texts following the description given arrive at the percentage score based on the data in Table in section 5.1. However, we induce interaction between the 1. This is calculated by dividing each score by the total input texts at only one level. What this means is that we possible score (200) and then multiplying by 100 in order perform only the intra-sentence attention without any need to obtain a percentage score. We can see that our model for the inter-sentence attention. Apart from this modifica- significantly outperforms the minimum student score. Also, tion, every other part of the model remains intact. Table 4 we obtain a better score than the mean student score. We shows the result of our system against three other systems can see that the model shows an appreciable approximation when evaluated on the COLIEE dataset in the context of of understanding of the legal technical jargon. We expect Textual Entailment. The first and the third are the baseline to have an improved performance once we have a sizable le- systems, i.e., the result reported by the authors in [17]. The gal text collection, which we can use to train the Word2Vec second is a participant in the COLIEE task [16]. We can see algorithm for obtaining the embedding matrix for our vo- that our model slightly outperforms the reported papers. cabulary words. In reality, it is even better if such texts 6.3 Discussion are related to the MBE exam. This will produce semanti- cally rich embeddings that will capture many legal terms. In Table 2 shows the result obtained on the LQA corpus addition, using extra facts, e.g., as proposed in the second when the main evaluation was done. We see that our model format of the corpus, should improve the performance since significantly outperforms a TFIDF baseline. Throughout many extra details for general learning would be captured. the evaluation, we use the standard accuracy metric. To val- idate our model, we inspected the questions that were scored correctly by our models but incorrectly by the TFIDF base- 7. CONCLUSION line. We give one example of such passage-question-pair. In In this paper, we presented a Legal Question Answering this particular example, the TFIDF baseline predicted the system using a Deep Neural Network technique. Specifically, wrong label for each of the answer options. we employed a LSTM Neural Network which has the ability to retain information much longer than a conventional Re- Example 4: current Neural Network. We also described a corpus which Passage: After being fired, a woman sued her former em- has been extracted from the USA MBE exams. We formal- ployer in federal court, alleging that her supervisor had dis- ize the task as that of Answer-Sentence-Selection, where the criminated against her on the basis of her sex. The woman’s system selects the correct answer to a question given a back- complaint included a lengthy description of what the super- ground passage. When compared against a TFIDF baseline, visor had said and done over the years, quoting his telephone our model displayed a significantly better performance. Sim- calls and emails to her and her own emails to the supervisor’s ilarly, when compared against the human performance based manager asking for help. The employer moved for summary on the statistics available from student performance in MBE judgment, alleging that the woman was a pathological liar Exam. The system obtained a better performance than the who had filed the action and included fictitious documents mean student score. The proposed task is different from in revenge for having been fired. Because the woman’s attor- the Textual Entailment task. However, the system shows a ney was at a lengthy out-of-state trial when the summary- good result on a textual entailment dataset. In the future, judgment motion was filed, he failed to respond to it. The we would like to obtain more data from Legal tests like the MBE or any equivalent exams in other countries. We pro- In Legal Knowledge and Information Systems - JURIX vided a dataset with more information that explains why an 2015: The Twenty-Eighth Annual Conference, Braga, answer is correct or otherwise. Intuitively, ML algorithms Portual, December 10-11, 2015, pages 179–180, 2015. may learn from the extra information to guide their choice [10] Minwei Feng, Bing Xiang, Michael R Glass, Lidan of answer. However, this part is currently lacking in our Wang, and Bowen Zhou. Applying deep learning to work. In our future work, we would like to explore how we answer selection: A study and an open task. In 2015 can improve the performance of our system by incorporat- IEEE Workshop on Automatic Speech Recognition and ing this evidential information as described in section 3. In Understanding (ASRU), pages 813–820. IEEE, 2015. particular, it would be interesting to compare ML models [11] Jianfeng Gao, Li Deng, Michael Gamon, Xiaodong He, that take advantage of this information to those who have and Patrick Pantel. Modeling interestingness with no access to such information. deep neural networks, June 13 2014. US Patent App. 14/304,863. ACKNOWLEDGEMENT [12] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Kolawole J. Adebayo has received funding from the Erasmus Suleyman, and Phil Blunsom. Teaching machines to Mundus Joint International Doctoral (Ph.D.) programme in read and comprehend. In Advances in Neural Law, Science and Technology. Luigi Di Caro and Guido Information Processing Systems, pages 1693–1701, Boella have received funding from the European Union’s 2015. H2020 research and innovation programme under the grant [13] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason agreement No 690974 for the project ”MIREL: MIning and Weston. The goldilocks principle: Reading children’s REasoning with Legal texts”. The authors would like to books with explicit memory representations. arXiv thank the anonymous reviewers who have suggested ways to preprint arXiv:1511.02301, 2015. improve the quality of the paper. [14] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 8. REFERENCES 9(8):1735–1780, 1997. [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua [15] Mohit Iyyer, Jordan L Boyd-Graber, Leonardo Bengio. Neural machine translation by jointly learning Max Batista Claudino, Richard Socher, and Hal to align and translate. arXiv preprint Daumé III. A neural network for factoid question arXiv:1409.0473, 2014. answering over paragraphs. In EMNLP, pages [2] Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, 633–644, 2014. Abby Vander Linden, Brittany Harding, Brad Huang, [16] Adebayo Kolawole John, Luigi Di Caro, Guido Boella, Peter Clark, and Christopher D Manning. Modeling and Cesare Bartolini. An approach to information biological processes for reading comprehension. In retrieval and question answering in the legal domain. EMNLP, 2014. [17] Mi-Young Kim, Ying Xu, and Randy Goebel. A [3] Antoine Bordes, Nicolas Usunier, Sumit Chopra, and convolutional neural network in legal question Jason Weston. Large-scale simple question answering answering. with memory networks. arXiv preprint [18] Mi-Young Kim, Ying Xu, and Randy Goebel. Legal arXiv:1506.02075, 2015. question answering using ranking svm and [4] Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and syntactic/semantic similarity. In JSAI International Hang Li. Learning to rank: from pairwise approach to Symposium on Artificial Intelligence, pages 244–258. listwise approach. In Proceedings of the 24th Springer, 2014. international conference on Machine learning, pages [19] Ankit Kumar, Ozan Irsoy, Jonathan Su, James 129–136. ACM, 2007. Bradbury, Robert English, Brian Pierce, Peter [5] Ido Dagan, Oren Glickman, and Bernardo Magnini. Ondruska, Ishaan Gulrajani, and Richard Socher. Ask The pascal recognising textual entailment challenge. me anything: Dynamic memory networks for natural In Machine learning challenges. evaluating predictive language processing. pages 0–6. uncertainty, visual object classification, and [20] LR Medsker and LC Jain. Recurrent neural networks. recognising tectual entailment, pages 177–190. Design and Applications, 2001. Springer, 2006. [21] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey [6] Oren Etzioni, Anthony Fader, Janara Christensen, Dean. Efficient estimation of word representations in Stephen Soderland, and Mausam Mausam. Open vector space. arXiv preprint arXiv:1301.3781, 2013. information extraction: The second generation. In [22] Ankur P Parikh, Oscar Täckström, Dipanjan Das, and IJCAI, volume 11, pages 3–10, 2011. Jakob Uszkoreit. A decomposable attention model for [7] Anthony Fader, Stephen Soderland, and Oren Etzioni. natural language inference. arXiv preprint Identifying relations for open information extraction. arXiv:1606.01933, 2016. In Proceedings of the Conference on Empirical Methods [23] Jeffrey Pennington, Richard Socher, and in Natural Language Processing, pages 1535–1545. Christopher D Manning. Glove: Global vectors for Association for Computational Linguistics, 2011. word representation. In EMNLP, volume 14, pages [8] Laurene V Fausett. Fundamentals of neural networks. 1532–43, 2014. Prentice-Hall, 1994. [24] Jürgen Schmidhuber. Deep learning in neural [9] Biralatei Fawei, Adam Z. Wyner, and Jeff Z. Pan. networks: An overview. Neural networks, 61:85–117, Passing a USA national bar exam - a first experiment. 2015. [25] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014. [26] Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. End-to-end memory networks. In Advances in neural information processing systems, pages 2440–2448, 2015. [27] Harry Surden. Machine learning and law. 2014. [28] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075, 2015. [29] Oanh Thi Tran, Bach Xuan Ngo, Minh Le Nguyen, and Akira Shimazu. Answering legal questions by mining reference information. In JSAI International Symposium on Artificial Intelligence, pages 214–229. Springer, 2013. [30] Peter D Turney, Patrick Pantel, et al. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37(1):141–188, 2010. [31] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Armand Joulin, and Tomas Mikolov. Towards ai-complete question answering: A set of prerequisite toy tasks. arXiv preprint arXiv:1502.05698, 2015. [32] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014. [33] Adam Wyner and Wim Peters. On rule extraction from regulations. In JURIX, volume 11, pages 113–122. Citeseer, 2011. [34] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT, pages 1480–1489, 2016. [35] Wenpeng Yin, Sebastian Ebert, and Hinrich Schütze. Attention-based convolutional neural network for machine comprehension. arXiv preprint arXiv:1602.04341, 2016.