An Examination of the Validity of General Word Embedding Models for Processing Japanese Legal Texts Linyuan Tang Kyo Kageura linyuan-tang@g.ecc.u-tokyo.ac.jp kyo@p.u-tokyo.ac.jp Graduate School of Interdisciplinary Information Studies, Interfaculty Initiative in Information Studies, The University of Tokyo The University of Tokyo Tokyo, Japan Tokyo, Japan ABSTRACT The high ratio of sub-technical terms in legal English vocabulary Thanks to the recent developments in distributed representation differentiates it “from the lexicon of other LSP (Language for Spe- learning and the large amounts of published and digitized legal cific Purposes) varieties” [6]. The existence of technical terms and texts, computational linguistic analysis of legal language becomes sub-technical terms indicates that words in the specialised domain possible and efficient. However, most of these open language re- are not only different from general language in the aspect of vo- sources and shared tasks are in English. For the languages that cabulary, but also in the aspect of semantics. That can make the have little open legal texts like Japanese, a word embedding model application of general word embedding models to specialised do- trained on the specific language usages is accompanied by the con- mains inefficient and unreasonable due to the inconsistency of the cern of less accuracy and representativeness. Based on the obser- semantic spaces. Thus, to let the compositions of the processing vation that legal language shares a modest common vocabulary texts stay along with the embedding models in the same semantic with general language, we examined the validity of using the pre- space, the domain-specific models are preferable. trained general word embedding model for processing legal texts Unfortunately, in comparison with English, there are less open by an intrinsic evaluation constructed on pairs of synonyms and source data of legal texts in Japanese. Either the documents are related terms which were extracted from a legal term dictionary. not in machine-readable data format, or they are not even open to We first investigated the settings of hyperparameters of the em- public. As estimated in [7] that “both using more data and higher bedding models trained on legal texts. Then we compared the per- dimensional word vectors will improve the accuracy”, less data will formances of our domain-specific models with general models. The cause lower accuracy conversely. Nevertheless, that legal language pre-trained Wikipedia model conducted a better performance than has a high ratio of overlapping of the vocabulary with general lan- domain-specific models on detecting semantic relations. This model guage provides a possibility for us to apply general embedding also showed a higher compatibility with legal texts than the gen- models. eral model trained on newspaper articles. Although researchers Therefore, in this paper, we examine whether general embed- tend to indicate the importance of domain-specific representation ding models, specifically, a Japanese word embedding model pre- models, a general model can still be an alternative solution when trained on Japanese Wikipedia and a model trained on newspaper there is little language resource. articles, can be used when processing legal texts. We start with constructing a similarity and relatedness task as an intrinsic eval- uation of trained embedding models. Pairs of synonyms and re- 1 INTRODUCTION lated terms are extracted from a Japanese legal term dictionary. We Owing to the emergence of Word2Vec [7, 8] and the following ex- train domain-specific embedding models on two legal text datasets plosive improvements in distributed representation learning, use with different settings of hyperparameters and investigate the best of distributed representation models as features becomes a para- configurations. The comparison of the general embedding models digm in automated semantic analysis. In general, for the construc- and the domain-specific models are then conducted by the intrinsic tion of such models, trainings on large-scale balanced corpora are evaluation. ideal and necessary, and for evaluation, shared downstream tasks Although the performance of a model mostly depends on down- and robust evaluation measures are required. stream tasks, we believe that it is also important for researchers Resources of general language usages are abundant in major to have an awareness of the distributed representations inside of languages. However, when processing texts in specialised domains, the embedding models when trying to use them to achieve better vocabularies of these domains can be very different from general scores in specific tasks and to solve the real world problems. language. Besides those so-called “technical terms” appeared in ev- ery specialised domain, there are also words called “sub-technical 2 RELATED WORK terms” that “activate a specialised meaning in the legal field, be- ing frequently used as general words in everyday language” [5]. NLP tasks related to legal issues, including legal information re- trieval, document classification, question answering methods and In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa- so on, have been increasingly attracting attention from both com- tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada. putational linguists and legal professionals. © 2019 Copyright held by the owner/author(s). Copying permitted for private and To improve the performances of theses tasks with the assis- academic purposes. Published at http://ceur-ws.org tance of semantic analysis, there were two word embedding mod- els specifically trained on legal texts. One was the pre-trained model ASAIL 2019, June 21, 2019, Montreal, QC, Canada Tang and Kageura built in a Python library called LexNLP [1]. LexNLP focused on Table 1: Basic statistics of our dataset. natural language processing and machine learning for legal and regulatory text. The pre-trained models were based on thousands Corpus #Token #Type of real documents and various judicial and regulatory proceedings. dictionary 781,027 20,328 The other one was Law2Vec 1 provided by LIST2 . This model “ori- judgements 790,665 17,423 ented to legal text trained on large corpora comprised of legislation dictionary+judgements 1,571,692 30,915 from UK, EU, Canada, Australia, USA, and Japan among other le- newspaper 22,928,051 242,630 gal documents.” Although legal texts of Japan seemed to be used for achieving semantic representations of words in the legal do- main, the used texts were English-translated and the models were given when t has no definition and is labeled with a See tag, while for legal English. a related term of t is labeled with a See Also tag when the experts COLIEE (the legal question answering Competition on Legal In- thought more information was needed formation Extraction/Entailment) [4] is the only competition about Japanese legal texts and providing law articles both in Japanese and Judgements. We obtained 2,306 judgements passed on criminal English as a knowledge resource. COLIEE 2017 focused on extrac- cases in district courts nationwide from 2008 to 2017. Legal English tion and entailment identification aspects of legal information pro- is known as legalese because of its tedious and puzzling language cessing related to answering yes/no questions from Japanese legal usage. Legal Japanese also shared these problems. Therefore, in bar exams. Carvalho et el. [2] and Nanda et el. [9] both tested the order to conduct a moderate comparison with newspaper articles Google News dataset pre-trained vectors3 in information retrieval, in the aspect of contents and document lengths, we extracted “the and the former team also found that the “pure common text em- fact of the crime” part from each judgement. bedding” resulted in poor performance, “most probably due to the Newspaper. When a case happened, it is often reported as an absence of legal vocabulary and corresponding semantics.” article in the social section of the newpaper. Additionally, the lan- The evaluation of the word embeddings trained from different guage usage in a newspaper article can be considered as a general textual resources has been conducting in the biomedical domain. usage, or at least less specialized than legalese used in legal texts. Roberts [12] revealed that combinations of corpora led to a better We obtained all the articles from a one-year (2015) corpus. 1748 performance. Wang et al. [13] concluded that the word embeddings legal terms were observed in these articles. trained on the biomedical domain did not necessarily have better performance than those trained on the general domain. While they both agreed that the efficiency of a word embedding model was 4 METHODS task-dependent, Gu et al. [3] argued that even smaller domain- Before processing, we applied Japanese morphological analyzer specific corpora may be preferable to pre-trained word embed- Chasen5 to split the sentences into words and remove signals and dings built on a general corpus if the diversity of vocabulary was numbers. The similarity measured between two vectors in this pa- low. per were all cosine similarity. In general, related work tends to indicate the importance of The examining procedure was in two steps. First, we built a term domain-specific distributed representation models for processing pair inventory for performance evaluation. Term pairs were sepa- specialised texts. rated into synonym pairs and related pairs. Domain-specific mod- els were then trained with hyper parameter tuning on this inven- 3 DATA tory. Second, we focused on the common term pairs existed in both general models and domain-specific models. The performances of Our dataset consists of three corpus, a dictionary of legal terms the models were examined on both synonym detection and related (hereinafter, referred to as dictionary), “the fact of the crime” parts term detection. of the judgements obtained from Westlaw Japan 4 judicial prece- dent corpus (referred to as judgements), and newspaper articles contained in Mainichi Newspaper Corpus (referred to as newspa- 4.1 Task Design per). Basic statistics of our corpus are given in Table 1. Detailed We extracted 1440 pairs of synonyms and 6641 pairs of related descriptions of each corpus are given below. terms by exploiting the indicative tags provided in the dictionary. These pairs constructed the gold standards of synonym detection Dictionary. The technical term dictionary adopted in this work task and related term detection task for evaluating each model’s was Yuhikaku Legal Term Dictionary (4th edition). The dictionary ability of catching semantic relations between terms. consists of 13,812 entry words with the definitions written by ex- We evaluated the performances of models by counting how many perts and carefully edited. We simply referred the word “legal term” semantic relations were correctly caught by each model. Specifi- (or “term”) to the entry words that were recorded in the dictionary cally, we first obtained top n most similar words of term t from the instead of getting involved in the sophisticated discussion about model. n was set to {1, 5, 10}. If the synonym or the related term the meaning of the word. In the dictionary, a synonym of term t is was in these most similar words, we treated the trial as a correct 1 https://archive.org/details/Law2Vec. one. The performance was represented by accuracy as the ratio of 2 http://www.luxli.lu/university-of-athens/. correctly predicted pairs to all synonym or related term pairs. 3 https://code.google.com/archive/p/word2vec. 4 https://www.westlawjapan.com/. 5 version: 0.996, neologd 102. The Validity of General Embedding Models for Processing Japanese Legal Texts ASAIL 2019, June 21, 2019, Montreal, QC, Canada Table 2: Vocabulary sizes of word embedding models. The sizes of Table 5: The best accuracy scores on related term detection domain-specific models are presented in the order of min.count = under different configurations. (6641 related term pairs) {2, 3, 5}. The min.count value of general models were 3. Related term (%) Source #Vocabulary Model top 1 top 5 top 10 dictionary 8,803 7,202 5,697 judgements 9,539 7,747 5,991 dictionary 83 (1.2%) 163 (2.5%) 206 (3.1%) dictionary+judgements 14,292 11,769 9,318 dictionary+judgements 78 (1.2%) 159 (2.4%) 208 (3.1%) Wikipedia 1,463,528 Wikipedia 142 (2.1%) 371 (5.6%) 472 (7.1%) newspaper 242,630 newspaper 43 (0.6%) 107 (1.6%) 153 (2.3%) Table 3: Hyperparameter tuning for domain-specific models. Table 6: Results of synonym detection. (18 synonym pairs) Parameter Value Synonym (%) dimension 50, 100, 200, 300, 400 Model top 1 top 5 top 10 window size 2, 3, 5, 10, 15 min.count 2, 3, 5 dictionary 1 (5.6%) 1 (5.6%) 1 (5.6%) negative sample 3, 5, 10, 15 Wikipedia 3 (16.7%) 8 (44.4%) 8 (44.4%) newspaper 1 (5.6%) 2 (11.1%) 4 (22.2%) Table 4: The best accuracy scores on synonym detection un- der different configurations. (1440 synonym pairs) Table 7: Results of related term detection. (564 related term pairs) Synonym (%) Model top 1 top 5 top 10 Related term (%) dictionary 3 (0.2%) 5 (0.3%) 6 (0.4%) Model top 1 top 5 top 10 dictionary+judgements 3 (0.2%) 4 (0.3%) 5 (0.3%) dictionary 44 (7.8%) 84 (14.9%) 108 (19.1%) Wikipedia 15 (1.0%) 56 (3.9%) 79 (5.5%) Wikipedia 58 (10.3%) 135 (24.0%) 165 (29.3%) newspaper 5 (0.3%) 18 (1.3%) 27 (1.8%) newspaper 18 (3.2%) 40 (7.1%) 49 (8.7%) 4.2 Model Training We applied pre-trained Wikipedia Entity Vectors as our general word for synonym pairs, and 19 (0.3%) for related term pairs. In both embedding model 6 . It is a 300-dimension Skip-Gram Negative Sam- tasks, additional legal texts (i.e., judgements) did not improve the pling (SGNS) model. With the same training configuration of it, performance of our domain-specific models, which indicated that we trained another general model on newspaper articles for the our legal text dataset is biased to the dictionary dataset and the comparison within general models. We then trained our domain- more data does not always lead to the better performance. specific models on the dictionary and the judgements, respectively The default training configuration of gensim is {dimension = and together. The size of the source data and vocabularies are given 100, window size = 5, min.count = 5, negative sample = 5}. The se- in Table 2. lected configuration after a hyperparameter tuning on an English The performance of word embedding models can be improved domain-specific model training [10] was {dimension = 400, win- by hyperparameter tuning. Since the effects of different configu- dow size = 5, min.count = 5, negative sample = 5}. However, we rations can be diverse, we investigated hyperparameter settings as found that window size or negative sample that was lower than 10 in [10]. We exploited gensim [11] for model training. Examined would led to worse performances in all circumstances. Due to the parameters and values are shown in Table 3. Each model had five relatively tiny data size, min.count that larger than 3 also had a chances on each task. negative effect on the performances. The most suitable configuration for the models trained on the 5 RESULTS dictionary across the variation of top_n was {dimension = 300, win- 5.1 Model Tuning dow size = 15, min.count = 3, negative sample = 10}. It is similar to the configuration of the Wikipedia model which is {dimension The best accuracy scores of models on the two tasks under differ- = 300, window size = 10, min.count = 3, negative sample = 10}. ent configurations are shown in Table 4, 5. The models trained on We selected the same values of parameters as the Wikipedia judgements failed in detecting both synonym and relatedness re- model as the training configuration of our domain-specific model lations. The best accuracy of those judgement models was 0 (0.0%) with which the general models would be compared on the next 6 https://github.com/singletongue/WikiEntVec. Wikipedia data until 2018.10.01. stage. ASAIL 2019, June 21, 2019, Montreal, QC, Canada Tang and Kageura 5.2 Intrinsic Evaluation [4] Yoshinobu Kano, Mi-Young Kim, Randy Goebel, and Ken Satoh. 2017. Overview of COLIEE 2017. In COLIEE 2017. 4th Competition on Legal Information Extraction As shown in Table 4, 5, the Wikipedia model achieved higher per- and Entailment (EPiC Series in Computing), Ken Satoh, Mi-Young Kim, Yoshi- formances on detecting semantic relations of legal terms, even those nobu Kano, Randy Goebel, and Tiago Oliveira (Eds.), Vol. 47. EasyChair, 1–8. https://doi.org/10.29007/fm8f relations were obtained from the legal domain. This result can be [5] María José Marín and Camino Rea. 2014. Researching Legal Terminology: A due to the absence of low frequency terms in the dictionary corpus. Corpus-based Proposal for the Analysis of Sub-technical Legal Terms. ASp 66 Therefore, we further conducted two detection tasks on the com- (nov 2014), 61–82. https://doi.org/10.4000/asp.4572 [6] María José Marín Pérez. 2016. Measuring the Degree of Special- mon pairs among the domain-specific model, the Wikipedia model isation of Sub-technical Legal Terms through Corpus Comparison: and the newspaper model. There were 18 common synonym pairs A Domain-independent Method. Terminology 22, 1 (2016), 80–102. and 465 common related term pairs. Results of the experiment are https://doi.org/10.1075/term.22.1.04mar [7] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Ef- shown in Table 6, 7. ficient Estimation of Word Representations in Vector Space. (2013). The Wikipedia model achieved the best accuracy score among arXiv:cs.CL/1301.3781v3 [8] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. three models, while the same general embedding model, the news- Distributed Representations of Words and Phrases and their Compositionality. paper model, was the worst. The performance difference between In Advances in neural information processing systems. 3111–3119. the Wikipedia model and the newspaper model also confirmed that [9] Rohan Nanda, Adebayo Kolawole John, Luigi Di Caro, Guido Boella, and Livio Robaldo. 2017. Legal Information Retrieval Using Topic Clustering and Neu- the performance of general models are effected by the diversity of ral Networks. In COLIEE 2017. 4th Competition on Legal Information Extraction general language resources. The similar results of the examination and Entailment (EPiC Series in Computing), Ken Satoh, Mi-Young Kim, Yoshi- on the common term pairs to the examination on all term pairs in- nobu Kano, Randy Goebel, and Tiago Oliveira (Eds.), Vol. 47. EasyChair, 68–78. https://doi.org/10.29007/psgx dicated that the Wikipedia model is superior to the domain-specific [10] Farhad Nooralahzadeh, Lilja Øvrelid, and Jan Tore Lønning. 2018. Eval- dictionary model for catching the intrinsic semantic relations of le- uation of Domain-specific Word Embeddings using Knowledge Re- sources. In Proceedings of the 11th Language Resources and Evaluation gal terms. Conference. European Language Resource Association, Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1228 6 CONCLUSION [11] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Work- Since the usefulness of an embedding model mostly depends on the shop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. downstream tasks, we don’t argue that which embedding model is http://is.muni.cz/publication/884893/en. [12] Kirk Roberts. 2016. Assessing the Corpus Size vs. Similarity Trade-off for Word better or worse for legal NLP tasks. The purpose of this research Embeddings in Clinical NLP. In Proceedings of the Clinical Natural Language Pro- is to investigate whether a general corpus could be used when the cessing Workshop (ClinicalNLP). The COLING 2016 Organizing Committee, Os- training on the specific domain is not practicable. The word em- aka, Japan, 54–63. https://www.aclweb.org/anthology/W16-4208 [13] Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Rastegar-Mojarad, Liwei bedding model built on Wikipedia showed a considerable perfor- Wang, Feichen Shen, Paul Kingsbury, and Hongfang Liu. 2018. A mance on the intrinsic evaluation. The legal domain is different Comparison of Word Embeddings for the Biomedical Natural Language Processing. Journal of Biomedical Informatics 87 (nov 2018), 12–20. from other specialised domains in the aspect of the ratio of over- https://doi.org/10.1016/j.jbi.2018.09.008 lapping words with general language. This characteristic is helpful when there are not enough domain-specific language resources. In this paper, we provided some evidence that domain-specific word embedding models are not always outperform general models and not all the domain-specific texts are useful when constructing the semantic relations among technical terms. The using of general word embedding models, especially the models trained on a bal- anced large-scale corpus, therefore can be considered as an alter- native way to processing those domain-specific texts. ACKNOWLEDGMENTS The authors would like to thank YUHIKAKU Publishing Co., Ltd. for providing the legal dictionary dataset. We are also grateful to the reviewers for their valuable comments and suggestions. REFERENCES [1] Michael James Bommarito, Daniel Martin Katz, and Eric Detterman. 2018. LexNLP: Natural Language Processing and Information Extrac- tion For Legal and Regulatory Texts. SSRN Electronic Journal (2018). https://doi.org/10.2139/ssrn.3192101 [2] Danilo S. Carvalho, Vu Tran, Khanh Van Tran, and Nguyen Le Minh. 2017. Im- proving Legal Information Retrieval by Distributional Composition with Term Order Probabilities. In COLIEE 2017. 4th Competition on Legal Information Ex- traction and Entailment (EPiC Series in Computing), Ken Satoh, Mi-Young Kim, Yoshinobu Kano, Randy Goebel, and Tiago Oliveira (Eds.), Vol. 47. EasyChair, 43–56. https://doi.org/10.29007/2xzw [3] Yang Gu, Gondy Leroy, Sydney Pettygrove, Maureen Kelly Galindo, and Mar- garet Kurzius-Spencer. 2018. Optimizing Corpus Creation for Training Word Embedding in Low Resource Domains: A Case Study in Autism Spectrum Dis- order (ASD). AMIA Annual Symposium proceedings (2018), 508–517.