Dialog Acts Classification for Question-Answer Corpora Saurabh Chakravarty Raja Venkata Satya Phanindra Edward A. Fox saurabc@vt.edu Chava fox@vt.edu Virginia Tech chrvsp96@vt.edu Virginia Tech Blacksburg, VA Virginia Tech Blacksburg, VA Blacksburg, VA ABSTRACT challenge of identifying key concepts using NLP based rules. In Many documents are constituted by a sequence of question-answer many corpora, the root words that are most prevalent in sentences (QA) pairs. Applying existing natural language processing (NLP) help identify the core concepts present in a document. These core methods such as automatic summarization to such documents leads concepts help text processing systems capture information with to poor results. Accordingly, we have developed classification meth- high precision. However, traditional NLP techniques like syntax ods based on dialog acts to facilitate subsequent application of NLP parsing or dependency trees sometimes struggle to find the root of techniques. This paper describes the ontology of dialog acts we have conversational sentences because of their form. devised through a case study of a corpus of legal depositions that Humans, on the other hand, readily understand such documents are made of QA pairs, as well as our development of machine/deep since the number of types of questions and answers is limited, and learning classifiers to identify dialog acts in such corpora. We have these types provide strong semantic clues that aid comprehension. adapted state-of-the-art text classification methods based on a con- Accordingly, we seek to leverage the types found, to aid textual volutional neural network (CNN) and long short term memory analysis. (LSTM) to classify the questions and answers into their respective Defining and identifying each QA pair type would ease the pro- dialog acts. We have also used pre-trained BERT embeddings for cessing of the text, which in turn would facilitate downstream tasks one of our classifiers. Experimentation showed we could achieve like question answering, summarization, information retrieval, and an F1 score of 0.84 on dialog act classification involving 20 classes. knowledge graph generation. This is because special rules could be Given such promising techniques to classify questions and answers applied to each type of question and answer, allowing conversion into dialog acts, we plan to develop custom methods for each di- oriented to supporting existing NLP tools. This would facilitate text alog act, to transform each QA pair into a form that would allow parsing techniques like constituency and dependency parsing and for the application of NLP or deep learning techniques for other also enable us to break the text into different chunks based on part downstream tasks, such as summarization. of speech (POS) tags. Dialog Acts (DA) [19, 41] represent the communicative intention 1 INTRODUCTION behind a speaker’s utterance in a conversation. Identifying the DA of each speaker utterance in a conversation thus is a key first step Documents such as legal depositions contain conversations between in automatically determining intent and meaning. Specific rules can a set of two or more people, aimed at identifying observations and be developed for each DA type to process a conversation QA pair the facts of a case. The conversational actors are aware of the and transform it into a suitable form for subsequent analysis. De- current context, so need not include important contextual clues veloping methods to classify the DAs in a conversation thus would during their communication. Further, because of that awareness, help us delegate the transformation task to the right transformer their conversations may exhibit frequent context shifts. method. These conversations are in the form of rapid fire question-answer Text classification using deep learning techniques has rapidly (QA) pairs. Like many conversations, these documents are noisy, improved in recent years. Deep neural network based architectures only loosely following grammatical rules. Often, people don’t speak like Recurrent Neural Network (RNN) [13], Long Short Term Mem- using complete or well-formed sentences that can be comprehended ory (LSTM) [17], and Convolutional Neural Network (CNN) [16] in isolation. There are instances where a legal document is tran- now outperform traditional machine learning based text classifica- scribed by a court reporter and the conversation contains words tion systems. For example, LSTM and CNN networks help capture like “um” or “uh” that signify that the speaker is thinking. In many the semantic and syntactic context of a word. This enables the of the instances, there is an interruption that leads to incomplete systems based on LSTM and CNN to model word sequences better. sentences being captured or a sentence getting abandoned alto- There have been various architectures in the area of text classifica- gether. tion which use an encoder-decoder [8] based model for learning. These characteristics of QA conversations make it difficult to Systems using CNNs [2, 7, 10, 22, 34] or LSTMs [7, 39] have had sig- apply popular NLP processing methods, including co-reference nificant performance improvements over the previously established resolution and summarization techniques. For example, there is the baselines in text classification tasks like sentiment classification, In: Proceedings of the Third Workshop on Automated Semantic Analysis of Information machine translation, information retrieval, and polarity detection. in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada. Accordingly, we focus on deep learning based text classification © 2019 Copyright held by the owner/author(s). Copying permitted for private and techniques and fine-tune them for our task of DA classification. academic purposes. Published at http://ceur-ws.org The core contributions of this paper are as follows. ASAIL 2019, June 21, 2019, Montreal, QC, Canada Saurabh Chakravarty, Raja Venkata Satya Phanindra Chava, and Edward A. Fox (1) A Dialog Act ontology that pertains to the conversations in for the decoder. The attention layer in the encoder and decoder the legal domain. builds self-attention on the input and output words, respectively, (2) An annotated dataset that will be available for the research to learn what words are important. The masked attention layer in community. the decoder learns the attention only until the token in the output (3) Classification methods that use state-of-the-art techniques that has already been generated by the decoder so far. To train to classify Dialog Acts, and which have been fine-tuned for the model, the work involved learning on two tasks. The first task this specific task. was to guess a masked word in a sentence, where each sentence was from a large corpus. The authors removed a word randomly from a sentence and trained the model to predict the right word. 2 RELATED WORK The second task was to predict the following sentence for a given Early work on Dialog Act Classification [1, 14, 18, 23, 25, 28, 38, 40] sentence, from a choice of four sentences. The training was per- used machine learning techniques such as Support Vector Ma- formed using the Google Books Corpus (with 800M words) [27] and chines (SVM), Deep Belief Network (DBN), Hidden Markov Model English Wikipedia (with 2,500M words) [9]. The work obtained new (HMM), and Conditional Random Field (CRF). They used features state-of-the-art results on 11 NLP tasks as part of General Language like speaker interaction and prosodic cues, as well as lexical, syn- Understanding Evaluation (GLUE), and was very competitive in tactic, and semantic features, for their models. Some of the works other tasks. also included context features that were sourced from the previous Recent works like [20, 26, 33] use deep neural networks to clas- sentences. Work in [36, 38] used HMM for modeling the dialog act sify the dialog acts. These works used models like CNN and LSTM to probabilities with words as observations, where the context was model the context for a sentence. Work in [20] used a CNN+LSTM defined using the probabilities of the previous utterance dialog acts. model for the DA classification and slot-filling task using two differ- Work in [12, 18] used DBN for decoding the DA sequences and used ent datasets. They obtained a negligible improvement for one of the both the generative and the conditional modeling approaches to datasets and a significant improvement for the other. Work in [33] label the dialog acts. Work in [6, 12, 21, 32] used CRF to label the used a recurrent CNN based model to classify the DAs, and obtained sequences of dialog acts. a 2.9% improvement over the LM-HMM baseline. Work in [26] used The sentences in the QA pairs need to be modeled into a vector RNN and CNN based models for DA classification along with the representation so that we can use them as features for text classi- DA labels of the previous utterances to achieve state-of-the-art fication. Availability of rich word embeddings like word2vec [30] results in the DA classification task. and GloVe [31] have been effective in text classification tasks. These embeddings are learned from large text corpora like Google News 3 METHODS or Wikipedia. They are generated by training a neural network on As part of our methods, we defined an ontology of dialog acts for the text, where the objective is to maximize the probability of a the legal domain. Each sentence in the conversation was classified word given its context, or vice-versa. This objective helps the neu- into one of the classes. The following sections describe the ontology ral network to group words that are similar in a high-dimensional and classification methods in more detail. vector space. Work based on averaging of the word vectors [5] in a sentence has given good performance in text classification. 3.1 Dialog Act Ontology In late 2018, Google developed BERT (Bidirectional Encoder After a thorough analysis of the conversation QA pairs in our Representations from Transformers) [11], a powerful method for dataset of depositions, two researchers refined a subset of the dialog sentence embeddings. It was pre-trained on a massive corpus of un- acts found in [19]. These researchers also added additional dialog labeled data to build a neural network based language model. This acts to our ontology for the questions and answers, again based on allows BERT to achieve significantly higher performance for classi- their analysis of the depositions. The following sections present fication tasks which have a small task-specific data-set. The authors more details. argued that the current deep learning based language models to generate embeddings are unidirectional and there are challenges 3.1.1 Question specific dialog acts. Table 1 shows the different when we need to model sentences. Tasks such as attention based dialog acts that we have defined for the questions in the depositions. question answering require the architecture to attend to tokens We expanded the “wh” category, which covers many of the DAs before and after, during the self-attention stage. The core contribu- in a deposition, into sub-categories. This would enable specific tion was the generation of pre-trained sentence embeddings that comprehension techniques to be used on each sub-category as the were learned using the left and right context of each token in the sentences are varied for each of the sub-categories. Table 2 lists and sentence. The authors also proposed that these pre-trained embed- describes each sub-category for the “wh” parent category. dings can be used to model any custom NLP task by adding a final fully connected neural network layer and modeling the network 3.1.2 Answer specific dialog acts. Table 3 shows the different dialog output to the task at hand. There is no need to create a complex acts that we have defined for the questions in the depositions. network architecture. BERT internally uses the multi-layer net- work or “transformer” presented in [37] to model the input text 3.2 Dialog Act Classification and the output embedding. The transformer involves six layers of We used different classifiers based on deep learning that have attention, followed by normalization and a feed-forward layer as an achieved state-of-the-art results in multiple other tasks. We also encoder, and the same layers plus an added masked attention layer used simple classifiers that used sentence embeddings followed by Dialog Acts Classification for Question-Answer Corpora ASAIL 2019, June 21, 2019, Montreal, QC, Canada Category Description Example wh This is a wh-* kind of question. These questions generally start with question words like who, What time did you wake up on the morning the inci- what, where, when, why, how, etc. dent took place? wh-d This is also a wh-* kind of question. But if there is more than one statement in a what question, it You said generally wake up at 7:00 am in the morning. is a what-declarative question. These questions have some information prior to the actual question But what time did you wake up on the morning the which relates to the question. incident took place? bin This is a binary question. These are questions that can be answered with a simple “yes” or “no”. Is that where you live? bin-d This is a binary-declarative question which can also be answered with a “yes” or a “no”. But, in That is where you live, right? a binary-declarative question, the person who asks the question knows the answer but asks for verification. In contrast, a binary question indicates the examiner seeks to know which is the actual answer. qo This is an open question. These questions are general questions which are not specific to any Do you think Mr. Pace made a good decision? context. These questions are asked to know the opinions of the person who is answering. or This is a choice question. Choice questions are questions that offer a choice of several options as Were you working out for fun or were you into body an answer. They are made up of two parts, which are connected by the conjunction “or”. building? Table 1: Question dialog acts Category Description Example num It is a what question specific to numeric quantities. What is the age of your daughter? hum It is a what question specific to human beings. What is the name of your daughter? loc It is a what question specific to locations. What is the name of the city where your daughter lives? ent It is a what question specific to other entities. What is the email address of your daughter? des It is a what question which generally ask descriptive questions. What were you doing there at that point of time? Table 2: wh-question dialog acts Category Description Example y It is a category when a person answering the question means yes. The answer sentence can take “yes”, “yeah”, “Of course”, “definitely it is”, “that’s right”, various forms and the answer need not be exactly “yes”. “I am sure”, etc. y-d It is a category when a person answering the binary question not only says yes but also given an Yes. I play badminton because my doctor advised me explanation for this answer. to. y-followup The answer is yes, but in the answer, there is another question which pertains to the question Yes I have seen them. But what do you mean by inside asked. the elevator? n It is a category when a person answering the question means no. Again, the answer need not be “No”, “I don’t think so”, “certainly not”, “I am afraid exactly “no”. not”, etc. n-d It is a category when a person answering the binary question not only says no but also given an No. I am not interested in playing Cricket because it explanation for this answer. takes a lot of time n-followup The answer is no, but in the answer, there is another question which pertains to the question That is not me. Do you think that is me? asked. sno It is a statement which is a non-opinion. This is an informative statement made by the person I retired from my job in 2010. answering the question. so It is a statement which is an opinion. It is a statement which is actually an opinion of the person I believe retiring from my job was the best decision I answering rather than a general statement. made. ack It is a response which indicates acknowledgment. “Okay”, “Um-hum”, “I see”, etc. dno It is a response given when the person doesn’t know, or doesn’t recall, or is unsure about the I don’t recall what happened that day answer to the question asked. confront The answer contains no information. It is a confrontation by the deponent to the question asked. So do you say that I have given you the wrong infor- mation? Table 3: Answer dialog acts a fully connected neural network to check for efficacy of sentence embeddings like BERT in dialog act classification. The following sections describe the different classification methods we used to classify the dialog acts. 3.2.1 Classification using CNN. Work in [22] used CNN to capture the n-gram representation of a sentence using convolution. A win- dow size provided as a parameter was used to define the number of words to be included in the convolution filter. Figure 1 shows the convolution operation capturing a bi-gram representation. We used the architecture from the original work in [22] for learning Figure 1: An n-gram convolution filter [15, 22]. the sentence representation using a CNN. We added a feed-forward neural network layer in front of the representation layer to finally classify the dialog act for a given sentence. Tokens from a sentence are transformed into word vectors using word2vec, and fed into the network. This is followed by the convolution and max-pooling ASAIL 2019, June 21, 2019, Montreal, QC, Canada Saurabh Chakravarty, Raja Venkata Satya Phanindra Chava, and Edward A. Fox operations. The final sentence has a fixed size representation ir- respective of sentence length. As the system trains, the network is able to learn a sentence embedding as part of this layer. This representation is rich since it captures the semantic and syntactic re- lations between the words. Figure 2 shows a reference architecture Figure 3: Bi-directional LSTM with attention architecture [42]. was applied to the embedding, LSTM, and penultimate layers. L2- norm based penalties were also applied as part of the regularization. 3.2.3 Classification using BERT. In this method, we generate the sentence embeddings of the questions and answers via the BERT pre-trained model. BERT can be fine-tuned to any NLP task by adding a layer on the top of this architecture which makes it suitable for the task. Figure 4 shows the high-level architecture consisting of various components like embeddings and transformers. Figure 2: CNN based classifier architecture [4, 22]. of the whole CNN based approach for two classes. 3.2.2 Classification using LSTM with attention. Work in [42] used a bi-directional LSTM with an attention mechanism to capture the most important information contained in a sentence. It did not use any classical NLP system based features. Even though CNN can capture some semantic and syntactic dependencies between words using a larger feature map, it struggles to capture the long term dependencies between words if the sentences are long. LSTM based network architectures are better equipped to capture these long term dependencies since they employ a recurrent model. The context of the initial words can make their way down the recurrent chain based on the activation of the initial words and their gradients, Figure 4: BERT architecture [11, 37]. during the back propagation phase. Figure 3 shows the network architecture of the system. The words are fed into the network using their vector representation. In our system implementation, we used the BERT reference ar- The network processes the words in both directions. This helps chitecture and added a feed-forward neural network layer on top of the network learn the semantic information not only from the BERT sentence embeddings. We want to classify text with length words in the past, but also from the words in the future. The output that varies from roughly a portion of one sentence to a large para- layers of both the directional LSTMs are combined as one, using graph. Further, we are performing a single sentence classification an element-wise sum. An attention layer is added to this combined and not a sentence pair classification, as was mentioned in the output, with coefficients for each output unit. These coefficients act BERT paper. We use the BERT-Base, Cased pre-trained model for as the attention mechanism; attention priorities are learned by the our classification esperiments. Figure 5 shows the architecture for system during the training phase. These coefficients capture the our classifier. relative importance of the terms in the input sentence. The word In our experiment section, we will refer to the introduced classi- embeddings were also learned as part of the training. Dropout [35] fication methods as CNN, Bi-LSTM, and BERT, respectively. Dialog Acts Classification for Question-Answer Corpora ASAIL 2019, June 21, 2019, Montreal, QC, Canada Figure 5: BERT single sentence classification architecture[11]. 4 DATASET Legal depositions and trial testimonies represent a type of con- versations which have a specific format, where the attorney asks questions and the deponent or witness answers those questions. Figure 6 shows an example of a page in a legal deposition. Proper parsing of legal depositions is necessary to perform analysis for downstream tasks like summarization. 4.1 Proprietary Dataset Figure 6: Example page of a deposition. For our dialog acts classification experiments, we performed all our work on a proprietary dataset, provided by Mayfair Group LLC. Due to client privacy and confidentiality concerns, we are unable This dataset was made available to us as a courtesy by several law to share the proprietary dataset. The annotated tobacco dataset is firms. Our classification experiments were performed on this dataset available publicly1 for the research community to use. and results of this paper reflect the same. This dataset consists of around 350 depositions. The format of these documents follows 4.3 Data pre-processing conventional legal deposition standards. Legal depositions can be in a wide variety of formats like .pdf, .docx, .rtf, .txt, etc. Implementing a separate functionality for parsing 4.2 Tobacco Dataset different formats can be difficult and time-consuming. So, a common platform which can be used to parse deposition transcripts across The roughly 14 million Truth Tobacco Industry Documents con- all the formats in a generalized way is needed. Apache Tika [29], stitutes a public dataset, which contains legal documents, related developed by the Apache Software Foundation, can be used to to the settlement of court cases between US states and the seven extract metadata and content from across hundreds of file types major tobacco industry organizations, on willful actions of tobacco through a single interface. Apache Tika has Python support through companies to sell tobacco products despite their knowledge of the a library called tika. harmful effects. It was created in 2002 by the UCSF Library and Though there is a standard format for deposition documents, Center for Knowledge Management to provide public access to the different challenges were encountered while parsing the documents. many legal documents related to that settlement. This dataset in- Challenges faced in legal deposition document parsing include: cludes around 12,300 publicly available legal deposition documents (1) Varying number of columns per page, which can be accessed from the website maintained by UCSF [24]. (2) Header and footer elimination, and Our analysis and results can also be reproduced on this publicly available dataset. 1 The dataset can be downloaded from https://github.com/saurabhc123/asail_dataset ASAIL 2019, June 21, 2019, Montreal, QC, Canada Saurabh Chakravarty, Raja Venkata Satya Phanindra Chava, and Edward A. Fox (3) Determining the starting and ending points of the actual deposition conversation within the entire document. Generally, the PDF versions of legal depositions have multiple columns per page. Apache Tika reads multiple columns in a page separately by recognizing column separations which are encoded as extended ASCII codes. Hence, text from separate columns are parsed in the correct sequence. Header and footer in legal depositions constitute several things like the name of the person being deposed, name of the attorney, name of the law firm, e-mail IDs, phone numbers, page numbers, etc. Figure 6 shows an example of a page in a legal deposition with header and footer. We read the content parsed by Apache Tika line by line and use regular expressions (regex) in Python to search for a pattern within each line of the text. Using regex in Python, we convert every line to a string which contains only alphabets, periods, and question marks. Then, we use a dictionary in Python to store all the patterns and the list of indices of the lines in which Figure 8: Example of ending of “EXAMINATION” segment. those pattern has appeared. Finally, we check for the patterns which satisfy the below constraints and remove those lines from the text. extract the “EXAMINATION” segment based on the observed pat- (1) The number of times these patterns appear must be greater terns that represent beginning and ending of this segment that hold than or equal to the number of pages of the document. across our various depositions. (2) Those lines must not begin with the answer or question tags Finally, our pre-processing methods removed the noise from the (‘A.’ and ‘Q.’) and must not end with a question mark. text and only extracted the conversation part of the deposition. For example, in the document which is represented by Figure 6, patterns “sourcehttpswww.industrydocuments.ucsf.edudocspsmw”, 5 EXPERIMENTAL SETUP AND RESULTS “january”, “jamesfiglar”, “u.s.legalsupport” satisfy all of the above constraints, and hence the lines containing these patterns are re- 5.1 Experimental Setup moved from the entire text with the help of their indices which are The overall size of the derived dataset developed from the public stored in the dictionary. dataset for dialog acts classification was a total of about 2500 ques- tions and answers. This entire dataset was manually annotated, to provide a ground truth for evaluation. The dataset then was ran- domly divided into train, validation, and test datasets in the ratio 70:20:10, respectively, to be studied using each of the three classi- fiers. Table 4 shows the distribution of the classes for the whole dataset. Class Counts % of Total ack 36 1.46 bin 437 17.67 bin-d 369 14.92 cc 0 0.00 co 4 0.16 confront 21 0.85 dno 142 5.74 n 76 3.07 n-d 74 2.99 n-followup 1 0.04 nu 29 1.17 Figure 7: Example of beginning of “EXAMINATION” seg- or 18 0.73 qo 25 1.01 ment. sno 567 22.93 so 25 1.01 wh 298 12.05 After cleaning the text, pre-processing of data had to be done to wh-d 57 2.3 extract the needed data in the required format. A deposition tran- y 226 9.14 script can contain multiple segments within it (like “INDEX”, “EX- y-d 66 2.67 y-followup 2 0.08 HIBITS”, “APPEARANCES”, “EXAMINATION”, “STIPULATIONS”, Total 2473 - “CERTIFICATIONS”, etc). For our work, we only needed the “EX- Table 4: Class distribution for the dataset AMINATION” segment where the actual conversation between attorney(s) and deponent takes place. Figures 7 and 8 represent the starting and ending of the “EXAMINATION” segment. We only Dialog Acts Classification for Question-Answer Corpora ASAIL 2019, June 21, 2019, Montreal, QC, Canada Parameters Values 5.1.1 Environment setup. All the classification experiments were hidden layer size 200 run on a Dell server running Ubuntu 16.04, with 32 GB RAM and dropout 0.5 two Tesla P40 NVIDIA GPUs. output layer activation function sigmoid n-gram trigram 5.1.2 CNN classifier. Parameters that were fine-tuned for the CNN max-sequence length 32 with word2vec embeddings classifier are: batch-size 100 number of epochs 30 (1) hidden layer size: This was varied from 100 to 500 in steps Table 6: Best fine tuned parameters for CNN classifier for of 100. Tobacco dataset (2) dropout: This was varied from 0.1 to 0.5 in steps of 0.1. (3) output layer activation function: sigmoid, tanh, and relu. (4) n-gram: window size base on unigram, bi-gram, and tri-gram Parameters Values groupings. hidden layer size 128 (5) max-sequence length: It was kept constant at 32. embedding size 256 (6) batch-size: It was kept constant at 100. learning rate 0.01 max-sequence length 32 (7) number of epochs: It was varied from 10 to 50 until the batch-size 100 validation accuracy stopped improving any further. number of epochs 30 Table 7: Best fine tuned parameters for LSTM classifier for 5.1.3 LSTM classifier. Parameters that were fine-tuned for the Bi- Tobacco dataset directional LSTM with attention classifier are: (1) hidden layer size: This was varied between the values 32, 64, 128, and 256. Parameters Values (2) embedding size: This was varied between the values 32, 64, learning rate 2e-5 128, and 256. max-sequence length 32 batch-size 100 (3) learning rate: This was varied between the values 0.0001, number of epochs 30 0.001, 0.01, and 0.1. Table 8: Best fine tuned parameters for BERT classifier for (4) max-sequence length: It was kept constant at 32. Tobacco dataset (5) batch-size: It was kept constant at 100. (6) number of epochs: It was varied from 10 to 50 until the validation accuracy stopped improving any further. 5.1.4 BERT classifier. Parameters that were fine-tuned for the Figures 9, 10, and 11 represent train and test accuracy across BERT single sentence classifier are: number of epochs for the CNN, LSTM, and BERT classifiers, respec- tively. (1) learning rate: This was varied between the values 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, and 0.1. (2) max-sequence length: It was kept constant at 32. (3) batch-size: It was kept constant at 100. (4) number of epochs: It was varied from 10 to 50 until the validation accuracy stopped improving any further. 5.2 Results 5.2.1 System Comparisons. Table 5 lists each of the three classifiers and their corresponding best test F1 score. BERT outperformed the other methods by a significant margin and achieved an F1 score of 0.84. Classifier F1-score BERT 0.84 CNN 0.57 LSTM 0.71 Table 5: Classifiers and their F1 scores. Best result in bold. Figure 9: Train & test accuracy vs. epochs for CNN Tables 6, 7, and 8 gives the parameters of the CNN, LSTM and We observe from Figures 9, 10, and 11 that after 15 epochs, the BERT classifiers, respectively, with which the best results were training accuracy is still increasing but the validation accuracy achieved. remains almost constant. This indicates that after 15 epochs, the ASAIL 2019, June 21, 2019, Montreal, QC, Canada Saurabh Chakravarty, Raja Venkata Satya Phanindra Chava, and Edward A. Fox Class Precision Recall F1-score ack 1.00 0.67 0.80 bin 0.48 0.48 0.48 bin-d 0.75 0.71 0.73 confront 0.00 0.00 0.00 dno 0.62 0.50 0.55 n 1.00 0.70 0.82 n-d 1.00 0.40 0.57 nu 0.00 0.00 0.00 or 0.00 0.00 0.00 qo 0.00 0.00 0.00 sno 0.51 0.80 0.62 so 0.00 0.00 0.00 wh 0.44 0.50 0.47 wh-d 0.00 0.00 0.00 y 0.84 0.91 0.87 y-d 1.00 0.67 0.80 avg / total 0.57 0.60 0.57 Table 9: Classification scores for CNN Class Precision Recall F1-score Figure 10: Train & test accuracy vs. epochs for LSTM ack 0.86 1.00 0.92 bin 0.79 0.73 0.76 bin-d 0.67 0.83 0.74 confront 0.50 0.50 0.50 dno 0.82 0.75 0.78 n 1.00 0.75 0.86 n-d 0.80 0.80 0.80 nu 0.00 0.00 0.00 or 0.00 0.00 0.00 qo 0.00 0.00 0.00 sno 0.61 0.74 0.67 so 0.50 0.25 0.33 wh 0.88 0.82 0.85 wh-d 0.50 0.14 0.22 y 0.70 0.89 0.78 y-d 1.00 0.55 0.71 avg / total 0.72 0.72 0.71 Table 10: Classification scores for LSTM Class Precision Recall F1-score ack 1.00 1.00 1.00 bin 0.78 0.93 0.85 bin-d 0.74 0.74 0.74 confront 1.00 0.50 0.67 dno 0.93 1.00 0.97 Figure 11: Train & test accuracy vs. epochs for BERT n 0.89 1.00 0.94 n-d 0.86 0.75 0.80 nu 0.33 1.00 0.50 or 0.00 0.00 0.00 qo 0.00 0.00 0.00 models achieve a good fit. We also observe that the validation accu- sno 0.91 0.84 0.87 racy of BERT is highest compared to the CNN and LSTM classifiers, so 0.50 1.00 0.67 reaching around 83%. This is another indicator that the BERT clas- wh 0.92 0.85 0.88 wh-d 0.62 0.56 0.59 sifier is best suited for dialog acts classification of legal depositions y 0.92 1.00 0.96 as compared to the CNN and LSTM classifiers. y-d 0.92 0.86 0.89 Table 9, 10 and 11 show the precision, recall and the F1 scores avg / total 0.83 0.84 0.84 for the CNN, LSTM, and BERT classifiers respectively. Table 11: Classification scores for BERT 5.3 Error Analysis We chose the best performing classification results and performed a detailed error analysis on the misclassifications. Table 12 discusses 6 CONCLUSION AND FUTURE WORK the errors associated with each dialog act. We have not included the We parsed legal depositions in a wide variety of formats and ex- dialog acts that had fewer than 3 test samples or misclassifications. tracted the necessary conversation information, also removing Dialog Acts Classification for Question-Answer Corpora ASAIL 2019, June 21, 2019, Montreal, QC, Canada Class Analysis bin There were certain cases that were classified as bin-d instead of bin. There is a very subtle difference between bin and bin-d and the classifier sometimes struggles to detect this subtlety. There were a couple of instance where bin was classified as “wh”. This happened because the question had the word how or what included in it, but it was framed in such a way that the response would be a yes or no. “Is Altria Group, Inc. what is considered to be a public company?”. In this question, the classifier is not able to distinguish the exact difference and is classifying based on the observed words. bin-d Most of the misclassifications in this dialog act was the assignment to the “bin” category. The classifier is taking cues from the word and sometimes fails to recognize that there is some context to the question before the actual question is being asked. confront Most of the classifications of this kind were erroneous. This is due to the lack of training data for the classifier to effectively learn to distingish the “confront” class from the other classes. There are very few instances of this class in the depositions. Adding more specific training data for this class would help increase the classification performance. dno The misclassifications for the this class was due to the semantic formation of the sentence and there is very little for the classifier to distinguish from the other classes of “n-d” and “n”. n-d The few misclassications for this class was a result of having the word “no” appended in addition to a response of a “n-d” kind. qo Lack of training data and very few distinguishing words for the classifier to make an accurate judgment. More training data for this class would help increase the classification performance. sno The misclassifications for this class had the class assignment to “bin-d” or “wh”. On further analysis it was observed that the misclassified sentences were very long in length. Some of them ended with a form that made the classifier assign them in the “bin-d” or the “wh” classes. so There is very little to distinguish a “so” class from a “sno” class. Most misclassifications were of this kind. We believe they can be merged into one single category as part of our future work. wh The misclassifications for this class involved the assignment of the statement to the wh-d class. Looking at the statements, we can conclude that those statements could belong to the “wh-d” class. This was more of an annotation error instead of a misclassification. y-d In the two misclassifications, one of the statements was too long and was assigned the “so” category. For the other instance, the presence of the word “yes” in the statement made it get assigned to the “y” category, even though there was a sentence preceeding it. Table 12: Error analysis much of the noise, allowing for natural language processing (NLP) We plan to extend this work in the following ways. and deep learning techniques to be employed for further processing. (1) Use context information for Dialog Acts classification such as State-of-the-art summarization methods and NLP techniques using the dialog acts from previous utterances [3] to classify are difficult to apply to question-answer pairs. Our preliminary the current dialog act, to improve the classification accuracy. testing with summarization methods applied to QA pairs led to (2) Develop NLP and deep learning techniques to convert a poor results. Hence we desire a semantically equivalent, grammat- question-answer pair to a semantically equivalent represen- ically correct, and linguistically fluent representation to replace tation, to which it will be easy to apply a variety of NLP each QA pair. This should retain key information from the QA tools. pair so that summaries generated from that representation do not (3) Use state-of-the-art deep learning based abstractive summa- lose any important information from the actual conversation. To rization methods to generate summaries from those repre- achieve this, we carefully defined and developed a dialog act ontol- sentations. ogy which contains 20 dialog acts to capture the intention of the (4) Develop explainable AI methods so it will be clear how sum- speaker behind the utterance. The quality of the set of dialog acts maries were generated. is also enriched based on our study of the legal deposition domain. Classification of each question and answer into these dialog acts ACKNOWLEDGMENTS should aid in developing specific NLP rules or techniques to con- vert each question-answer pair into an appropriate representation. This work was made possible by the Virginia Tech’s Digital Library For classification purposes, we have created our own dataset by Research Laboratory (DLRL). We would also like to thank Ashin manually annotating around 2500 questions and answers into their Marin Thomas for her help with data annotation and running the corresponding dialog acts. This dataset helped us in training the experiments. Data in the form of legal depositions was provided classifiers and also in evaluating the performance of the classifiers. by Mayfair Group LLC. In accordance with Virginia Tech policies We have developed three deep learning based classification meth- and procedures and my ethical obligation as a researcher, we are ods for dialog acts classification: reporting that Dr. Edward Fox has an equity interest in Mayfair Group, LLC. whose data was used in this research. Dr. Fox has • Convolutional Neural Network (CNN) with word2vec em- disclosed those interests fully to Virginia Tech, and has in place an beddings, approved plan for managing any potential conflicts arising from • Bi-directional Long Short Term Memory (LSTM) with atten- this relationship. tion mechanism, and • Bidirectional Encoder Representations from Transformers REFERENCES (BERT). [1] Jeremy Ang, Yang Liu, and Elizabeth Shriberg. Automatic dialog act segmentation and classification in multiparty meetings. In Proceedings.(ICASSP’05). IEEE Inter- We experimented with these three classifiers and fine-tuned their national Conference on Acoustics, Speech, and Signal Processing, 2005., volume 1, pages I–1061. IEEE, 2005. various parameters. We performed training, validation, and testing [2] Phil Blunsom, Edward Grefenstette, and Nal Kalchbrenner. A convolutional with each of the three classifiers. We achieved F1 scores of 0.57 and neural network for modelling sentences. In Proceedings of the 52nd Annual 0.71 using the CNN and the LSTM based classifiers, respectively. Meeting of the Association for Computational Linguistics. ACL, 2014. [3] Chandrakant Bothe, Cornelius Weber, Sven Magg, and Stefan Wermter. A context- The highest F1 score of 0.84 was achieved using the BERT sentence based approach for dialogue act recognition using simple recurrent neural net- embeddings based classifier on the dialog act classification task. works. In Proceedings of the Eleventh International Conference on Language ASAIL 2019, June 21, 2019, Montreal, QC, Canada Saurabh Chakravarty, Raja Venkata Satya Phanindra Chava, and Edward A. Fox Resources and Evaluation (LREC-2018), 2018. [31] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global [4] Denny Britz. Understanding convolutional neural networks for NLP. URL: vectors for word representation. In Proceedings of the 2014 conference on empirical http://www. wildml. com/2015/11/understanding-convolutional-neuralnetworks-for- methods in natural language processing (EMNLP), pages 1532–1543, 2014. nlp/(visited on 11/07/2015), 2015. [32] Silvia Quarteroni, Alexei V Ivanov, and Giuseppe Riccardi. Simultaneous dialog [5] Eduardo PS Castro, Saurabh Chakravarty, Eric Williamson, Denilson Alves act segmentation and classification from human-human spoken conversations. Pereira, and Edward A Fox. Classifying short unstructured data using the Apache In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing Spark platform. In Proceedings of the 17th ACM/IEEE Joint Conference on Digital (ICASSP), pages 5596–5599. IEEE, 2011. Libraries, pages 129–138. IEEE Press, 2017. [33] LM Rojas-Barahona, M Gašić, N Mrkšić, PH Su, S Ultes, TH Wen, and S Young. [6] Lin Chen and Barbara Di Eugenio. Multimodality and dialogue act classification Exploiting sentence and context representations in deep neural models for spo- in the RoboHelper project. In Proceedings of the SIGDIAL 2013 Conference, pages ken language understanding. In COLING 2016-26th International Conference on 183–192, 2013. Computational Linguistics, Proceedings of COLING 2016: Technical Papers, pages [7] Kyunghyun Cho, Bart Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger 258–267, 2016. Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder- [34] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. A latent decoder for statistical machine translation. In EMNLP, 2014. semantic model with convolutional-pooling structure for information retrieval. In [8] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. Proceedings of the 23rd ACM International Conference on Conference on Information On the properties of neural machine translation: Encoder–decoder approaches. and Knowledge Management, pages 101–110. ACM, 2014. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in [35] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus- Statistical Translation, pages 103–111, 2014. lan Salakhutdinov. Dropout: a simple way to prevent neural networks from [9] William Coster and David Kauchak. Simple English Wikipedia: a new text overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014. simplification task. In Proceedings of the 49th Annual Meeting of the Association [36] Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, for Computational Linguistics: Human Language Technologies: short papers-Volume Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie 2, pages 665–669. Association for Computational Linguistics, 2011. Meteer. Dialogue act modeling for automatic tagging and recognition of conver- [10] Misha Denil, Alban Demiraj, Nal Kalchbrenner, Phil Blunsom, and Nando de Fre- sational speech. Computational linguistics, 26(3):339–373, 2000. itas. Modelling, visualising and summarising documents with a single convolu- [37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, tional neural network. arXiv preprint arXiv:1406.3830, 2014. Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you [11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- Need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish- training of deep bidirectional transformers for language understanding. CoRR, wanathan, and R. Garnett, editors, Advances in Neural Information Processing abs/1810.04805, 2018. Systems 30, pages 5998–6008. Curran Associates, Inc., 2017. [12] Alfred Dielmann and Steve Renals. Recognition of dialogue acts in multiparty [38] Anand Venkataraman, Luciana Ferrer, Andreas Stolcke, and Elizabeth Shriberg. meetings using a switching DBN. IEEE transactions on audio, speech, and language Training a prosody-based dialog act tagger from unlabeled data. In 2003 IEEE processing, 16(7):1303–1314, 2008. International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceed- [13] Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990. ings.(ICASSP’03)., volume 1, pages I–I. IEEE, 2003. [14] Raul Fernandez and Rosalind W Picard. Dialog act classification from prosodic [39] Xin Wang, Yuanchao Liu, SUN Chengjie, Baoxun Wang, and Xiaolong Wang. features using support vector machines. In Speech Prosody 2002, International Predicting polarities of tweets by composing word embeddings with long short- Conference, 2002. term memory. In Proceedings of the 53rd Annual Meeting of the Association for [15] Yoav Goldberg. Neural network methods for natural language processing. Syn- Computational Linguistics and the 7th International Joint Conference on Natural thesis Lectures on Human Language Technologies, 10(1):1–309, 2017. Language Processing (Volume 1: Long Papers), volume 1, pages 1343–1353, 2015. [16] Simon Haykin. Neural networks, volume 2. Prentice Hall New York, 1994. [40] N Webb, M Hepple, and Y Wilks. Dialog act classification based on intra-utterance [17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural features. cs-05-01. Dept. of Computer Science, University of Sheffield, UK, 2005. computation, 9(8):1735–1780, 1997. [41] Jason Williams. A belief tracking challenge task for spoken dialog systems. [18] Gang Ji and Jeff Bilmes. Dialog act tagging using graphical models. In Proceed- In NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog ings.(ICASSP’05). IEEE International Conference on Acoustics, Speech, and Signal Community: Tools and Data (SDCTD 2012), pages 23–24, 2012. Processing, 2005., volume 1, pages I–33. IEEE, 2005. [42] Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu. [19] Daniel Jurafsky, Elizabeth Shriberg, Barbara Fox, and Traci Curl. Lexical, prosodic, Attention-based bidirectional long short-term memory networks for relation and syntactic cues for dialog acts. Journal on Discourse Relations and Discourse classification. In Proceedings of the 54th Annual Meeting of the Association for Markers, 1998. Computational Linguistics (Volume 2: Short Papers), volume 2, pages 207–212, [20] Nal Kalchbrenner and Phil Blunsom. Recurrent convolutional neural networks 2016. for discourse compositionality. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality, pages 119–126, 2013. [21] Su Nam Kim, Lawrence Cavedon, and Timothy Baldwin. Classifying dialogue acts in one-on-one live chats. In Proceedings of the 2010 Conference on Empir- ical Methods in Natural Language Processing, pages 862–871. Association for Computational Linguistics, 2010. [22] Yoon Kim. Convolutional neural networks for sentence classification. In Proceed- ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746âĂŞ–1751. ACL, 2014. [23] Pavel Král and Christophe Cerisara. Automatic dialogue act recognition with syntactic features. Language resources and evaluation, 48(3):419–441, 2014. [24] UCSF Library and Center for Knowledge Management. Truth Tobacco Industry Documents, 2002. https://www.industrydocuments.ucsf.edu/tobacco. [25] Yang Liu. Using SVM and error-correcting codes for multiclass dialog act classifi- cation in meeting corpus. In Ninth International Conference on Spoken Language Processing, 2006. [26] Yang Liu, Kun Han, Zhao Tan, and Yun Lei. Using context information for dialog act classification in DNN framework. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2170–2178, 2017. [27] Mark Davies. Google books corpora, 2011. [Online; accessed 28-April-2019]. [28] Marion Mast, Ralf Kompe, Stefan Harbeck, Andreas Kießling, Heinrich Niemann, Elmar Noth, Ernst Günter Schukat-Talamazzini, and Volker Warnke. Dialog act classification with the help of prosody. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, volume 3, pages 1732–1735. IEEE, 1996. [29] Chris Mattmann and Jukka Zitting. Tika in Action. Manning Publications Co., Greenwich, CT, USA, 2011. [30] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.