An Examination of the Validity of General Word Embedding
            Models for Processing Japanese Legal Texts
                               Linyuan Tang                                                               Kyo Kageura
               linyuan-tang@g.ecc.u-tokyo.ac.jp                                                      kyo@p.u-tokyo.ac.jp
    Graduate School of Interdisciplinary Information Studies,                           Interfaculty Initiative in Information Studies,
                    The University of Tokyo                                                        The University of Tokyo
                         Tokyo, Japan                                                                    Tokyo, Japan

ABSTRACT                                                                           The high ratio of sub-technical terms in legal English vocabulary
Thanks to the recent developments in distributed representation                    differentiates it “from the lexicon of other LSP (Language for Spe-
learning and the large amounts of published and digitized legal                    cific Purposes) varieties” [6]. The existence of technical terms and
texts, computational linguistic analysis of legal language becomes                 sub-technical terms indicates that words in the specialised domain
possible and efficient. However, most of these open language re-                   are not only different from general language in the aspect of vo-
sources and shared tasks are in English. For the languages that                    cabulary, but also in the aspect of semantics. That can make the
have little open legal texts like Japanese, a word embedding model                 application of general word embedding models to specialised do-
trained on the specific language usages is accompanied by the con-                 mains inefficient and unreasonable due to the inconsistency of the
cern of less accuracy and representativeness. Based on the obser-                  semantic spaces. Thus, to let the compositions of the processing
vation that legal language shares a modest common vocabulary                       texts stay along with the embedding models in the same semantic
with general language, we examined the validity of using the pre-                  space, the domain-specific models are preferable.
trained general word embedding model for processing legal texts                       Unfortunately, in comparison with English, there are less open
by an intrinsic evaluation constructed on pairs of synonyms and                    source data of legal texts in Japanese. Either the documents are
related terms which were extracted from a legal term dictionary.                   not in machine-readable data format, or they are not even open to
We first investigated the settings of hyperparameters of the em-                   public. As estimated in [7] that “both using more data and higher
bedding models trained on legal texts. Then we compared the per-                   dimensional word vectors will improve the accuracy”, less data will
formances of our domain-specific models with general models. The                   cause lower accuracy conversely. Nevertheless, that legal language
pre-trained Wikipedia model conducted a better performance than                    has a high ratio of overlapping of the vocabulary with general lan-
domain-specific models on detecting semantic relations. This model                 guage provides a possibility for us to apply general embedding
also showed a higher compatibility with legal texts than the gen-                  models.
eral model trained on newspaper articles. Although researchers                        Therefore, in this paper, we examine whether general embed-
tend to indicate the importance of domain-specific representation                  ding models, specifically, a Japanese word embedding model pre-
models, a general model can still be an alternative solution when                  trained on Japanese Wikipedia and a model trained on newspaper
there is little language resource.                                                 articles, can be used when processing legal texts. We start with
                                                                                   constructing a similarity and relatedness task as an intrinsic eval-
                                                                                   uation of trained embedding models. Pairs of synonyms and re-
1    INTRODUCTION                                                                  lated terms are extracted from a Japanese legal term dictionary. We
Owing to the emergence of Word2Vec [7, 8] and the following ex-                    train domain-specific embedding models on two legal text datasets
plosive improvements in distributed representation learning, use                   with different settings of hyperparameters and investigate the best
of distributed representation models as features becomes a para-                   configurations. The comparison of the general embedding models
digm in automated semantic analysis. In general, for the construc-                 and the domain-specific models are then conducted by the intrinsic
tion of such models, trainings on large-scale balanced corpora are                 evaluation.
ideal and necessary, and for evaluation, shared downstream tasks                      Although the performance of a model mostly depends on down-
and robust evaluation measures are required.                                       stream tasks, we believe that it is also important for researchers
   Resources of general language usages are abundant in major                      to have an awareness of the distributed representations inside of
languages. However, when processing texts in specialised domains,                  the embedding models when trying to use them to achieve better
vocabularies of these domains can be very different from general                   scores in specific tasks and to solve the real world problems.
language. Besides those so-called “technical terms” appeared in ev-
ery specialised domain, there are also words called “sub-technical                 2 RELATED WORK
terms” that “activate a specialised meaning in the legal field, be-
ing frequently used as general words in everyday language” [5].                    NLP tasks related to legal issues, including legal information re-
                                                                                   trieval, document classification, question answering methods and
In: Proceedings of the Third Workshop on Automated Semantic Analysis of Informa-   so on, have been increasingly attracting attention from both com-
tion in Legal Text (ASAIL 2019), June 21, 2019, Montreal, QC, Canada.              putational linguists and legal professionals.
© 2019 Copyright held by the owner/author(s). Copying permitted for private and       To improve the performances of theses tasks with the assis-
academic purposes.
Published at http://ceur-ws.org                                                    tance of semantic analysis, there were two word embedding mod-
                                                                                   els specifically trained on legal texts. One was the pre-trained model
ASAIL 2019, June 21, 2019, Montreal, QC, Canada                                                                                 Tang and Kageura


built in a Python library called LexNLP [1]. LexNLP focused on                              Table 1: Basic statistics of our dataset.
natural language processing and machine learning for legal and
regulatory text. The pre-trained models were based on thousands                        Corpus                        #Token       #Type
of real documents and various judicial and regulatory proceedings.                     dictionary                     781,027     20,328
The other one was Law2Vec 1 provided by LIST2 . This model “ori-                       judgements                     790,665     17,423
ented to legal text trained on large corpora comprised of legislation                  dictionary+judgements        1,571,692     30,915
from UK, EU, Canada, Australia, USA, and Japan among other le-                         newspaper                   22,928,051    242,630
gal documents.” Although legal texts of Japan seemed to be used
for achieving semantic representations of words in the legal do-
main, the used texts were English-translated and the models were          given when t has no definition and is labeled with a See tag, while
for legal English.                                                        a related term of t is labeled with a See Also tag when the experts
   COLIEE (the legal question answering Competition on Legal In-          thought more information was needed
formation Extraction/Entailment) [4] is the only competition about
Japanese legal texts and providing law articles both in Japanese and         Judgements. We obtained 2,306 judgements passed on criminal
English as a knowledge resource. COLIEE 2017 focused on extrac-           cases in district courts nationwide from 2008 to 2017. Legal English
tion and entailment identification aspects of legal information pro-      is known as legalese because of its tedious and puzzling language
cessing related to answering yes/no questions from Japanese legal         usage. Legal Japanese also shared these problems. Therefore, in
bar exams. Carvalho et el. [2] and Nanda et el. [9] both tested the       order to conduct a moderate comparison with newspaper articles
Google News dataset pre-trained vectors3 in information retrieval,        in the aspect of contents and document lengths, we extracted “the
and the former team also found that the “pure common text em-             fact of the crime” part from each judgement.
bedding” resulted in poor performance, “most probably due to the
                                                                             Newspaper. When a case happened, it is often reported as an
absence of legal vocabulary and corresponding semantics.”
                                                                          article in the social section of the newpaper. Additionally, the lan-
   The evaluation of the word embeddings trained from different
                                                                          guage usage in a newspaper article can be considered as a general
textual resources has been conducting in the biomedical domain.
                                                                          usage, or at least less specialized than legalese used in legal texts.
Roberts [12] revealed that combinations of corpora led to a better
                                                                          We obtained all the articles from a one-year (2015) corpus. 1748
performance. Wang et al. [13] concluded that the word embeddings
                                                                          legal terms were observed in these articles.
trained on the biomedical domain did not necessarily have better
performance than those trained on the general domain. While they
both agreed that the efficiency of a word embedding model was
                                                                          4 METHODS
task-dependent, Gu et al. [3] argued that even smaller domain-            Before processing, we applied Japanese morphological analyzer
specific corpora may be preferable to pre-trained word embed-             Chasen5 to split the sentences into words and remove signals and
dings built on a general corpus if the diversity of vocabulary was        numbers. The similarity measured between two vectors in this pa-
low.                                                                      per were all cosine similarity.
   In general, related work tends to indicate the importance of              The examining procedure was in two steps. First, we built a term
domain-specific distributed representation models for processing          pair inventory for performance evaluation. Term pairs were sepa-
specialised texts.                                                        rated into synonym pairs and related pairs. Domain-specific mod-
                                                                          els were then trained with hyper parameter tuning on this inven-
3    DATA                                                                 tory. Second, we focused on the common term pairs existed in both
                                                                          general models and domain-specific models. The performances of
Our dataset consists of three corpus, a dictionary of legal terms
                                                                          the models were examined on both synonym detection and related
(hereinafter, referred to as dictionary), “the fact of the crime” parts
                                                                          term detection.
of the judgements obtained from Westlaw Japan 4 judicial prece-
dent corpus (referred to as judgements), and newspaper articles
contained in Mainichi Newspaper Corpus (referred to as newspa-
                                                                          4.1      Task Design
per). Basic statistics of our corpus are given in Table 1. Detailed       We extracted 1440 pairs of synonyms and 6641 pairs of related
descriptions of each corpus are given below.                              terms by exploiting the indicative tags provided in the dictionary.
                                                                          These pairs constructed the gold standards of synonym detection
   Dictionary. The technical term dictionary adopted in this work         task and related term detection task for evaluating each model’s
was Yuhikaku Legal Term Dictionary (4th edition). The dictionary          ability of catching semantic relations between terms.
consists of 13,812 entry words with the definitions written by ex-           We evaluated the performances of models by counting how many
perts and carefully edited. We simply referred the word “legal term”      semantic relations were correctly caught by each model. Specifi-
(or “term”) to the entry words that were recorded in the dictionary       cally, we first obtained top n most similar words of term t from the
instead of getting involved in the sophisticated discussion about         model. n was set to {1, 5, 10}. If the synonym or the related term
the meaning of the word. In the dictionary, a synonym of term t is        was in these most similar words, we treated the trial as a correct
1 https://archive.org/details/Law2Vec.
                                                                          one. The performance was represented by accuracy as the ratio of
2 http://www.luxli.lu/university-of-athens/.                              correctly predicted pairs to all synonym or related term pairs.
3 https://code.google.com/archive/p/word2vec.
4 https://www.westlawjapan.com/.                                          5 version: 0.996, neologd 102.
The Validity of General Embedding Models for Processing Japanese Legal Texts                     ASAIL 2019, June 21, 2019, Montreal, QC, Canada

Table 2: Vocabulary sizes of word embedding models. The sizes of                 Table 5: The best accuracy scores on related term detection
domain-specific models are presented in the order of min.count =                 under different configurations. (6641 related term pairs)
{2, 3, 5}. The min.count value of general models were 3.

                                                                                                                         Related term (%)
                Source                         #Vocabulary                        Model                              top 1       top 5           top 10
               dictionary                 8,803     7,202 5,697
              judgements                  9,539     7,747 5,991                   dictionary                   83 (1.2%)     163 (2.5%)     206 (3.1%)
        dictionary+judgements            14,292    11,769 9,318                   dictionary+judgements        78 (1.2%)     159 (2.4%)     208 (3.1%)

                Wikipedia                            1,463,528                    Wikipedia                  142 (2.1%)     371 (5.6%)     472 (7.1%)
                newspaper                              242,630                    newspaper                   43 (0.6%)      107 (1.6%)     153 (2.3%)


   Table 3: Hyperparameter tuning for domain-specific models.
                                                                                 Table 6: Results of synonym detection. (18 synonym pairs)

                  Parameter                     Value
                                                                                                                     Synonym (%)
                  dimension             50, 100, 200, 300, 400
                                                                                          Model              top 1        top 5        top 10
                 window size                2, 3, 5, 10, 15
                  min.count                     2, 3, 5                                   dictionary      1 (5.6%)      1 (5.6%)      1 (5.6%)
                negative sample              3, 5, 10, 15                                 Wikipedia     3 (16.7%)     8 (44.4%)     8 (44.4%)
                                                                                          newspaper       1 (5.6%)    2 (11.1%)     4 (22.2%)
Table 4: The best accuracy scores on synonym detection un-
der different configurations. (1440 synonym pairs)
                                                                                 Table 7: Results of related term detection. (564 related term
                                                                                 pairs)
                                                  Synonym (%)
   Model                                  top 1        top 5           top 10
                                                                                                                   Related term (%)
   dictionary                          3 (0.2%)       5 (0.3%)       6 (0.4%)          Model               top 1           top 5          top 10
   dictionary+judgements               3 (0.2%)       4 (0.3%)       5 (0.3%)
                                                                                       dictionary     44 (7.8%)        84 (14.9%)    108 (19.1%)
   Wikipedia                        15 (1.0%)       56 (3.9%)      79 (5.5%)           Wikipedia     58 (10.3%)      135 (24.0%)    165 (29.3%)
   newspaper                         5 (0.3%)       18 (1.3%)      27 (1.8%)           newspaper      18 (3.2%)         40 (7.1%)      49 (8.7%)


4.2     Model Training
We applied pre-trained Wikipedia Entity Vectors as our general word              for synonym pairs, and 19 (0.3%) for related term pairs. In both
embedding model 6 . It is a 300-dimension Skip-Gram Negative Sam-                tasks, additional legal texts (i.e., judgements) did not improve the
pling (SGNS) model. With the same training configuration of it,                  performance of our domain-specific models, which indicated that
we trained another general model on newspaper articles for the                   our legal text dataset is biased to the dictionary dataset and the
comparison within general models. We then trained our domain-                    more data does not always lead to the better performance.
specific models on the dictionary and the judgements, respectively                  The default training configuration of gensim is {dimension =
and together. The size of the source data and vocabularies are given             100, window size = 5, min.count = 5, negative sample = 5}. The se-
in Table 2.                                                                      lected configuration after a hyperparameter tuning on an English
   The performance of word embedding models can be improved                      domain-specific model training [10] was {dimension = 400, win-
by hyperparameter tuning. Since the effects of different configu-                dow size = 5, min.count = 5, negative sample = 5}. However, we
rations can be diverse, we investigated hyperparameter settings as               found that window size or negative sample that was lower than 10
in [10]. We exploited gensim [11] for model training. Examined                   would led to worse performances in all circumstances. Due to the
parameters and values are shown in Table 3. Each model had five                  relatively tiny data size, min.count that larger than 3 also had a
chances on each task.                                                            negative effect on the performances.
                                                                                    The most suitable configuration for the models trained on the
5 RESULTS                                                                        dictionary across the variation of top_n was {dimension = 300, win-
5.1 Model Tuning                                                                 dow size = 15, min.count = 3, negative sample = 10}. It is similar
                                                                                 to the configuration of the Wikipedia model which is {dimension
The best accuracy scores of models on the two tasks under differ-
                                                                                 = 300, window size = 10, min.count = 3, negative sample = 10}.
ent configurations are shown in Table 4, 5. The models trained on
                                                                                    We selected the same values of parameters as the Wikipedia
judgements failed in detecting both synonym and relatedness re-
                                                                                 model as the training configuration of our domain-specific model
lations. The best accuracy of those judgement models was 0 (0.0%)
                                                                                 with which the general models would be compared on the next
6 https://github.com/singletongue/WikiEntVec. Wikipedia data until 2018.10.01.   stage.
ASAIL 2019, June 21, 2019, Montreal, QC, Canada                                                                                                  Tang and Kageura


5.2    Intrinsic Evaluation                                                          [4] Yoshinobu Kano, Mi-Young Kim, Randy Goebel, and Ken Satoh. 2017. Overview
                                                                                         of COLIEE 2017. In COLIEE 2017. 4th Competition on Legal Information Extraction
As shown in Table 4, 5, the Wikipedia model achieved higher per-                         and Entailment (EPiC Series in Computing), Ken Satoh, Mi-Young Kim, Yoshi-
formances on detecting semantic relations of legal terms, even those                     nobu Kano, Randy Goebel, and Tiago Oliveira (Eds.), Vol. 47. EasyChair, 1–8.
                                                                                         https://doi.org/10.29007/fm8f
relations were obtained from the legal domain. This result can be                    [5] María José Marín and Camino Rea. 2014. Researching Legal Terminology: A
due to the absence of low frequency terms in the dictionary corpus.                      Corpus-based Proposal for the Analysis of Sub-technical Legal Terms. ASp 66
Therefore, we further conducted two detection tasks on the com-                          (nov 2014), 61–82. https://doi.org/10.4000/asp.4572
                                                                                     [6] María José Marín Pérez. 2016.             Measuring the Degree of Special-
mon pairs among the domain-specific model, the Wikipedia model                           isation of Sub-technical Legal Terms through Corpus Comparison:
and the newspaper model. There were 18 common synonym pairs                              A Domain-independent Method.              Terminology 22, 1 (2016), 80–102.
and 465 common related term pairs. Results of the experiment are                         https://doi.org/10.1075/term.22.1.04mar
                                                                                     [7] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013.                Ef-
shown in Table 6, 7.                                                                     ficient Estimation of Word Representations in Vector Space.               (2013).
   The Wikipedia model achieved the best accuracy score among                            arXiv:cs.CL/1301.3781v3
                                                                                     [8] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
three models, while the same general embedding model, the news-                          Distributed Representations of Words and Phrases and their Compositionality.
paper model, was the worst. The performance difference between                           In Advances in neural information processing systems. 3111–3119.
the Wikipedia model and the newspaper model also confirmed that                      [9] Rohan Nanda, Adebayo Kolawole John, Luigi Di Caro, Guido Boella, and Livio
                                                                                         Robaldo. 2017. Legal Information Retrieval Using Topic Clustering and Neu-
the performance of general models are effected by the diversity of                       ral Networks. In COLIEE 2017. 4th Competition on Legal Information Extraction
general language resources. The similar results of the examination                       and Entailment (EPiC Series in Computing), Ken Satoh, Mi-Young Kim, Yoshi-
on the common term pairs to the examination on all term pairs in-                        nobu Kano, Randy Goebel, and Tiago Oliveira (Eds.), Vol. 47. EasyChair, 68–78.
                                                                                         https://doi.org/10.29007/psgx
dicated that the Wikipedia model is superior to the domain-specific                 [10] Farhad Nooralahzadeh, Lilja Øvrelid, and Jan Tore Lønning. 2018. Eval-
dictionary model for catching the intrinsic semantic relations of le-                    uation of Domain-specific Word Embeddings using Knowledge Re-
                                                                                         sources. In Proceedings of the 11th Language Resources and Evaluation
gal terms.                                                                               Conference. European Language Resource Association, Miyazaki, Japan.
                                                                                         https://www.aclweb.org/anthology/L18-1228
6     CONCLUSION                                                                    [11] Radim Řehůřek and Petr Sojka. 2010.           Software Framework for Topic
                                                                                         Modelling with Large Corpora. In Proceedings of the LREC 2010 Work-
Since the usefulness of an embedding model mostly depends on the                         shop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50.
downstream tasks, we don’t argue that which embedding model is                           http://is.muni.cz/publication/884893/en.
                                                                                    [12] Kirk Roberts. 2016. Assessing the Corpus Size vs. Similarity Trade-off for Word
better or worse for legal NLP tasks. The purpose of this research                        Embeddings in Clinical NLP. In Proceedings of the Clinical Natural Language Pro-
is to investigate whether a general corpus could be used when the                        cessing Workshop (ClinicalNLP). The COLING 2016 Organizing Committee, Os-
training on the specific domain is not practicable. The word em-                         aka, Japan, 54–63. https://www.aclweb.org/anthology/W16-4208
                                                                                    [13] Yanshan Wang, Sijia Liu, Naveed Afzal, Majid Rastegar-Mojarad, Liwei
bedding model built on Wikipedia showed a considerable perfor-                           Wang, Feichen Shen, Paul Kingsbury, and Hongfang Liu. 2018.                    A
mance on the intrinsic evaluation. The legal domain is different                         Comparison of Word Embeddings for the Biomedical Natural Language
                                                                                         Processing.      Journal of Biomedical Informatics 87 (nov 2018), 12–20.
from other specialised domains in the aspect of the ratio of over-                       https://doi.org/10.1016/j.jbi.2018.09.008
lapping words with general language. This characteristic is helpful
when there are not enough domain-specific language resources. In
this paper, we provided some evidence that domain-specific word
embedding models are not always outperform general models and
not all the domain-specific texts are useful when constructing the
semantic relations among technical terms. The using of general
word embedding models, especially the models trained on a bal-
anced large-scale corpus, therefore can be considered as an alter-
native way to processing those domain-specific texts.

ACKNOWLEDGMENTS
The authors would like to thank YUHIKAKU Publishing Co., Ltd.
for providing the legal dictionary dataset. We are also grateful to
the reviewers for their valuable comments and suggestions.

REFERENCES
[1] Michael James Bommarito, Daniel Martin Katz, and Eric Detterman.
    2018.     LexNLP: Natural Language Processing and Information Extrac-
    tion For Legal and Regulatory Texts.         SSRN Electronic Journal (2018).
    https://doi.org/10.2139/ssrn.3192101
[2] Danilo S. Carvalho, Vu Tran, Khanh Van Tran, and Nguyen Le Minh. 2017. Im-
    proving Legal Information Retrieval by Distributional Composition with Term
    Order Probabilities. In COLIEE 2017. 4th Competition on Legal Information Ex-
    traction and Entailment (EPiC Series in Computing), Ken Satoh, Mi-Young Kim,
    Yoshinobu Kano, Randy Goebel, and Tiago Oliveira (Eds.), Vol. 47. EasyChair,
    43–56. https://doi.org/10.29007/2xzw
[3] Yang Gu, Gondy Leroy, Sydney Pettygrove, Maureen Kelly Galindo, and Mar-
    garet Kurzius-Spencer. 2018. Optimizing Corpus Creation for Training Word
    Embedding in Low Resource Domains: A Case Study in Autism Spectrum Dis-
    order (ASD). AMIA Annual Symposium proceedings (2018), 508–517.