Extending Thesauri Using Word Embeddings and the Intersection Method Jörg Landthaler, Bernhard Waltl, Dominik Huth, Daniel Braun and Florian Matthes Software Engineering for Business Information Systems Department for Informatics Technical University of Munich Boltzmannstr. 3, 85748 Garching bei München, Germany joerg.landthaler@tum.de, b.waltl@tum.de, dominik.huth@tum.de, daniel.braun@tum.de, matthes@in.tum.de Christoph Stocker and Thomas Geiger Department for Portals and Collaboration (EM45) DATEV eG Fürther Straße 111, 90329 Nürnberg, Germany christoph.stocker@datev.de, thomas.geiger@datev.de ABSTRACT One possibility is the employment of thesauri to capture In many legal domains, the amount of available and the ambiguous nature of natural language. Thesauri relevant literature is continuously growing. Legal content capture binary relations such as synonyms or antonyms providers face the challenge to provide their customers and some thesauri additionally cover hierarchical relevant and comprehensive content for search queries on relationships. Large general purpose thesauri have been large corpora. However, documents written in natural built, for example WordNet (Fellbaum, 1998). Thesauri language contain many synonyms and semantically related can be used for search query expansion to increase the concepts. Legal content providers usually maintain recall of information retrieval systems and particularly to thesauri to discover more relevant documents in their include documents that use synonymous words. search engines. Maintaining a high-quality thesaurus is an Maintaining a thesaurus is expensive and error prone, expensive, difficult and manual task. The word especially for large thesauri, see for example Dirschl [3]. In embeddings technology recently gained a lot of attention the area of computational linguistics, the automated for building thesauri from large corpora. We report our creation of thesauri has been investigated since the 1950s, experiences on the feasibility to extend thesauri based on a cf. Section 2. The distributional hypothesis claims that large corpus of German tax law with a focus on synonym words that share contexts likely have a more similar relations. Using a simple yet powerful new approach, called meaning (perceived by humans) than others. Since 2013 intersection method, we can significantly improve and there has been an increasing interest in a technology called facilitate the extension of thesauri. word embeddings that combines machine learning (ML) technologies with the distributional hypothesis. In contrast to distributed thesauri calculated based on statistical Keywords evaluations, the relatedness of words is calculated in a thesaurus, synsets, word embeddings, word2vec, parameter softer/iterative fashion and is easy to access using the study, intersection method, tax law cosine similarity measure. In this paper we investigate the applicability of the word embeddings technology, in particular the word2vec 1. INTRODUCTION implementation [16], to support humans to extend an Legal content providers offer their customers access to existing thesaurus. While the overall goal is to find new large amounts of different text documents. Clients expect a relevant synonym relations that can be suggested to search engine that serves all relevant search results in an humans to consider for inclusion in the existing thesaurus, ordered manner with most relevant results at the top. The one focus of this paper is how word embeddings can be expectation of users encompasses that all relevant trained such that the quality of the word embeddings is documents are returned be a major task in information good. The basic assumption is that high-quality word retrieval and receives much attention, also in the legal embeddings will lead to better suggestions for synonym domain, cf. Qiang and Conrad [14] or Grabmair et al. [7]. relations that are not present in the current thesaurus. Related use cases are the creation of thesauri from scratch In: Proceedings of the Second Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2017), June 16, 2017, London, UK. or the automated merging with 3rd party thesauri. Copyright c 2017 held by the authors. Copying permitted for private and academic Moreover, an unsolved problem is to determine only purposes. relevant synonyms given a word, i.e. to build sensible Published at http://ceur-ws.org synonym sets (synsets). Most approaches need to resort to “ y“ tion „fre ut „d edo liga „o b “ m β α Figure 2: Illustration of the characteristics of word embeddings in two-dimensional space [12]. The angle α between the two vectors for the words freedom and obligation is larger than the angle β, which reflects that the two words duty and obligation are semantically more related than duty and freedom or obligation and freedom. Figure 1: The used text corpus comprises different document types on the topic of German tax law with Table 1: The occurrence frequency of a token in a total amount of approximately 130.000 documents. the corpus has a strong impact on the quality The corpus comprises roughly 150 million tokens of the resulting embeddings for individual words. yielding a vocabulary size of circa half a million Therefore, we choose different evaluation selections words (not pre-processed). of synonym sets from the given thesaurus - that is specifically maintained for this corpus - based on the a fixed amount of relevant words or to rely on the minimal individual term frequency of the tokens; N identification of a suitable threshold of relatedness. We is defined as the minimum occurrence frequency of investigate a straight-forward approach to identify each individual token in the corpus to be included semantically closed synsets without resorting to unreliable in a synset in a evaluation thesaurus. thresholds for a large corpus of German tax law and report N Synsets Terms Terms/Group Relations our experiences. Using a given thesaurus that is manually 250 260 587 2.26 874 maintained specifically for this corpus, we conduct 625 125 289 2.31 464 parameter studies for the different parameters of word2vec. 1000 84 195 2.32 312 We propose and evaluate a novel approach to intersect result lists of a relatedness ranking of all words in the similar or related meanings tend to occur in similar vocabulary of the corpus. Multiple word2vec word contexts. The hypothesis is supported by many studies [25, embeddings models are calculated with different 15, 10, 22]. parameters. For a given word (target word), we calculate In 1964, Sparck Jones [27] used the distributional the relatedness ranking of all words in the corpus and hypothesis to automatically create thesauri using intersect the lists of the first top results among the word count-based methods. Many followed this approach, for embeddings models calculated with different word2vec example Grefenstette [8]. Thesauri are useful for several parameters. We can report promising results of our natural language processing problems and much effort has evaluation of the intersection method with the given corpus been put into improving distributional thesauri. Rychly and the corresponding manually maintained thesaurus. and Kilgarriff [26] developed a system called SketchEngine The remainder of this work is organized as follows: In that efficiently generates distributional thesauri from large Section 2 we give a brief summary of automatic thesauri datasets. They pre-process the dataset and remove word generation and related work. In Section 3 we give an pairs that have nothing in common before the actual overview of the corpus and the corresponding manually calculation is performed. Hence, their algorithm can maintained thesaurus used for all our experiments. The process a dataset with 2 billion words in less than 2 hours word embeddings technology is introduced in Section 4. (compared to 300 days without the removal). We study the different word2vec parameters in Section 5 In their project JoBimText, Riedl and Biemann [23] use a and present our intersection list method in Section 6. We parallelized approach based on MapReduce and a Pointwise evaluate the novel intersection method in Section 7 and Mutual Information (PMI) measure to improve calculation discuss limitations in Section 8. Finally, Section 9 speed as well as the quality of the generated thesaurus. summarizes our findings and discusses future work. Word embeddings can be seen as an evolution of distributional statistics enhanced with machine learning approaches. Traditional distributed thesauri are calculated 2. RELATED WORK based on co-occurrence counts, while word embeddings The manual creation of thesauri is a very labor intensive leverage sub-sampling methods that are heavily used in the process. There have been various attempts to automate machine learning domain. Word embeddings provide an the process. A popular approach emerged from the easy access to word relatedness via the cosine similarity distributional hypothesis formulated by Harris in 1954 [9]. measure. The distributional hypothesis claims that words with Kiela et al. proposed that during the training phase, word embeddings can be pushed in a particular direction We pre-process both the corpus and the thesaurus. The [11] and optimized for detecting relatedness. In the main corpus is pre-processed such that we include only TOEFL synonym task Freitag et al. [4] report considerably tokens with more than four characters, remove all special better result than for non-specialized embeddings. characters and punctuation and lowercase all tokens. A Thesauri often not only contain synonyms, but also single line with all cleaned and white-space-separated antonyms. Ono et al. [19] presented an approach to detect tokens is entered into the word2vec algorithm. For the antonyms, using word embeddings and distributional thesaurus, we additionally restrict ourselves to terms that information. Nguyen et al. [18] improve the discrimination consist of a single token. It is well known that the of antonyms and synonyms by integrating lexical contrast occurrence frequency of words is crucial to the quality of into their vector representation. In 2015, Rothe and resulting word embeddings. We extracted three different Schütze [24] presented AutoExtend, an extension of word evaluation sets where all words in the evaluation thesaurus embeddings, optimized to train embedding vectors for occur at least N={250,650,1000} times in the corpus, see synonym sets (one vector per synset) and their composing Table 1. lexemes. To the best of our knowledge and in contrast to All experiments have been carried out on an Intel Core our intersection method, all approaches use fixed-length i5-2500 (4x2333MHz) machine with 8 GB DDR3-1333 RAM result sets or fixed thresholds to build synsets. running Ubuntu 14.04, Python 2.7.6, Numpy 1.13.1, scipy Recently, word embeddings have been used for 0.16.1 and word2vec 0.1c (only versions ≥ 0.1c support the query-expansion for information retrieval directly, i.e. iterations parameter). without the creation of knowledge-bases. Examples for such query expansion using word embeddings are Ganguly et al., Zamani et al. and Amer et al. [5, 28, 20]. Query 4. WORD EMBEDDINGS expansion using word embeddings specialized for the legal Word embeddings are a family of algorithms producing domain has recently been proposed by Adebayo et al. [1]. dense vectors that represent words in the vocabulary of a corpus. The word embeddings can be trained using the Continuous Bag-of-words (CBOW) or Skip-gram training 3. DATASET & PRE-PROCESSING models depicted in Figure 3. We conduct our parameter studies and the evaluation of Word embeddings combine the distributional hypothesis our intersection method for extending thesauri on a legal with artificial neural networks [13]. Due to new efficient texts corpus provided by our industry partner DATEV eG. methods of calculating word embeddings, Mikolov et al. The corpus comprises different document types on the [16], word embeddings for several gigabytes of text data topic of German tax law, cf. Figure 3. The corpus of 150 can be calculated within hours. While word embeddings million pre-processed tokens yields a vocabulary of circa are still considered a Bag-of-words approach, word 180.000 entries. Our industry partner manually maintains embeddings do encode the general context of words in a high-quality thesaurus specifically for this corpus dense vectors. Mathematical operations, for example including approximately 12.000 synonym sets with around vector addition, can be carried out on the vectors while 36.000 terms. preserving their inherent semantic characteristics. Mikolov Input Projection Output Projection Output Input et al [17] show that word embeddings trained on fictional w(t-2) w(t-2) English literature capture semantic relationships among words. We illustrate such semantic relationships encoded w(t-1) w(t-1) in word embeddings in Figure 2. We noticed that relevant SUM SUM characteristics are recognizable even for word embeddings trained on comparably very small training corpora [12], at w(t) w(t) least regarding text similarity tasks. Hence, we assume that our corpus with 150 million tokens is large enough to w(t+1) w(t+1) produce word embeddings with sufficient quality. Next, we give a short summary of a selection of the most w(t+2) w(t+2) important implementations to calculate word embeddings: CBOW Skip-gram • word2vec: The original C++ implementation of word2vec by Mikolov et al. [16], [17] is very fast. It Figure 3: Continuous Bag-of-words (CBOW, CB) provides a multi-threaded implementation, but it and Skip-gram (SG) training model illustrations does not support check-pointing (i.e. resuming adapted from [16]. The word2vec algorithm computations after stopping).1 implements both training models. The basic idea behind the two training models is that either a word • gensim: The gensim implementation of word2vec is used to predict the context of it or the other way provides a Python interface to calculate word around - to use the context to predict a current embeddings and supports check-pointing.2 word. This task is iterated word by word over the corpus. The prediction can be thought of as • Apache Spark: Apache Spark includes a Java/Scala a classifier from machine learning. The vectors are implementation of word2vec that can be run in a collected from the weights of the artificial neural 1 https://github.com/kzhai/word2vec, word2vec version network that serves as a classifier. Hence, large 0.1c for MAC OS X, accessed on 22/January/2017 amounts of text can be used in an unsupervised 2 https://github.com/nicholas-leonard/word2vec, accessed fashion to train word embeddings. on 22/January/2017 Legend Overall Word Embeddings Similarity Corpus Word2Vec Synsets Artifact Approach Model Calculation Algorithm Experiments Parameters Intersections Figure 4: Illustration of our overall approach: word2vec parameters need to be chosen. Our intersection method can be applied after the (repeated) calculation of (fixed length) synsets. Hadoop cluster.3 around each word during the training phase. Arguing from a statistical linguistics point of view, large • deeplearning4j: This Java implementation is (word-)distances between two words (for example in embedded in the general deeplearning4j framework two consecutive sentences) usually lead to less for machine learning and is similar to gensim but influence of the words on each other [6]. implemented for usage with Java and Scala.4 • Iterations (I)): The iterations parameter defines • GloVe: GloVe vectors, presented by Pennington et the number of iterations over the full corpus and can al., [21] are a count-based algorithm implemented in C. be thought of as an artificial enlargement of the GloVe vectors are similar to the word2vec embeddings, corpus. The choice for this parameter heavily but optimize a different objective function.5 depends on the corpus size. A larger number of iterations is particularly useful for small corpora. • Torch: Torch is, similar to the well-known Tensorflow framework, a deep learning and scientific computing • Minimum Count: The occurrence frequency of framework that provides a Lua interface to a word2vec individual tokens has a strong impact on the quality implementation.6 of the resulting word embeddings. Using the minimum count parameter, one can control that • Tensorflow: Tensorflow is a deep learning words occur sufficiently often in the corpus. The framework provided by Google and offers (among downside of this parameter is that words in the others) a Python interface to its own word2vec vocabulary that do not occur often enough in the implementation.7 corpus will not have a vector. 5. WORD2VEC PARAMETERS • Alpha: The initial learning rate is a parameter that is derived from artificial neural network training and not The word2vec implementation has a large number of investigated, because we assume that it is very specific parameters. For the most relevant parameters, we to a concrete dataset. conducted a parameter study using a manually maintained thesaurus as the ground truth for an evaluation of the • Negative: The number of negative examples quality of the resulting word embeddings. While a presented during the training. Consult [6] for a thesaurus by nature cannot be perfectly sharp, we assume comprehensive explanation. that relations identified by humans have sufficient truth and by using a large number of relations identified by • CBOW or Skip-gram: The two possible training humans our assumption is that this is sufficient to identify models that can be used to train word embeddings general trends. The following list gives an overview of the with word2vec. Either the current word is used to most important parameters: predict the context of the current word or vice versa the context is used to predict the current word, cf. • Size (Dimensionality): The size of the resulting Figure 3. In our experiments the CBOW model results vectors is chosen manually. From an information faster in high quality word embeddings in less training entropy point of view this value needs to be large time. enough so that all relevant information can be We assume that the vector size, window size, negative and encoded in the vectors. However, the larger the iterations parameters are the most important parameters for vector size is chosen, the more computationally the creation or extension of a thesaurus given a corpus. We expensive training and subsequent calculations set the minimum count parameter to zero, because we want become. a vector for each word present in the corpus. • Window Size: The window size is a training The manually maintained thesaurus for our corpus parameter that defines the size of the context window contains groups of synonymously used words. A simple 3 example for such a group of synonyms is (lawsuit, case, https://spark.apache.org/, accessed on 22/January/2017 dispute). We are interested in how the vector size, window 4 https://deeplearning4j.org/word2vec.html, accessed on size, iterations and negative parameters affect similarity 22/January/2017 score lists calculated based on word embeddings trained by 5 http://nlp.stanford.edu/projects/glove/, accessed on word2vec. We introduce the RP-Score measure that 22/January/2017 6 measures the average positions of synonyms in ranking lists https://github.com/yoonkim/word2vec torch, accessed on 22/January/2017 of the terms. The RP-Score provides a measure to compare 7 https://github.com/tensorflow/tensorflow/tree/master/ the relateness of two words obtained from humans and tensorflow/examples/tutorials/word2vec, accessed on obtained by our word2vec approach. In contrast to the 22/January/2017 mean reciprocal rank (MRR), the RP-Score is not controversial litigation law-suit 8000 N=250 SG N=250 CB 7000 N=625 SG N=625 CB dispute lawsuit 6000 N=1000 SG N=1000 CB case RP-Score 5000 Figure 5: The cosine similarity distance measure is 4000 symmetric between (the word embeddings vectors of ) two words. The ranking positions of two words 3000 can be asymmetric, because other words can be 2000 closer to one of the two words, but more distant 5 10 15 20 to the other word, cf. Table 2. 8000 Window Size N=250 SG N=250 CB 7000 N=625 SG N=625 CB Table 2: For our parameter study we use a 6000 N=1000 SG N=1000 CB RP-Score measure that we call RP-Score. Using the cosine similarity measure, we compute an ordered list of 5000 the relatedness of all words in the vocabulary to one 4000 selected word from the vocabulary (target word). 3000 For three target words, the five most related words are listed. The word dispute has the ranking position 2000 (RP) 4 for the target word lawsuit. The RP-Score 100 200 300 400 500 600 700 for this example synset is 1+4+2+5+3+4 ≈ 3.16. 8000 Vector Size 6 N=250 SG N=250 CB RP lawsuit case dispute 7000 N=625 SG N=625 CB 1 case trial controversial 6000 N=1000 SG N=1000 CB 2 law-suit lawsuit trial RP-Score 5000 3 litigation litigation lawsuit 4 dispute law-suit case 4000 5 trial dispute litigation 3000 normalized to 1 and provides a more intuitive 2000 understanding for the quality of a word embeddings model. 1 2 3 4 5 6 7 8 9 10 The RP-Score measure is calculated as follows: For all 16000 Negative Samples target words in a synset we calculate a sorted list of all 14000 N=250 SG N=250 CB N=625 SG N=625 CB words in the vocabulary using the cosine similarity 12000 N=1000 SG N=1000 CB measure. The ranking list is ordered and most related 10000 RP-Score words are at the top of the list. We determine the position for each combination of two words in a synset and 8000 accumulate all ranking positions. We perform this 6000 calculation on all synsets and finally divide the value by 4000 the total number of relations among all words in a synset 2000 and aggregate among all synsets. RPw1 (w2 ) is defined as 5 10 15 20 the position of w2 in the result list of w1 . Iterations Figure 6: Parameter studies of different relevant P RPw1 (w2 ) X w1 ,w2 ∈s,w1 6=w2 word2vec parameters: window size, vector size, RP Score := s∈synsets |s|(|s| − 1) negative samples and number of iterations. For all parameters the computing costs increase with larger Note that the RP-Scores are not a sharp measure in our parameter values (cf. Figure 8) and a trade-off evaluation. On the one hand, a thesaurus maintained by between computing costs and quality is inevitable. humans can lack (and contain) similarity relations that are We measure the impact of the different word2vec included (or ignored) by an unsupervised word embeddings parameters on the quality of the resulting word calculation. For example, spelling mistakes are often not embeddings using the RP-Score with a thesaurus contained in a thesaurus but detected by our overall that is manually and specifially maintained for approach. Wrongly typed words can be good candidates the given text corpus. Good parameter choices for an inclusion in a thesaurus, because documents like have a reasonably small RP-Score. For example, judgments cannot be corrected. Nevertheless, we are able the default window size of 5 from the word2vec to observe reasonable results in our parameter studies, due implementation is a good choice already. Note that to the averaging of the RP-Score over a larger number of the CBOW training model results in better RP- evaluation synsets. Scores while having smaller runtime values for equal For all parameters investigated, we applied both CBOW parameter choices. Best viewed in color. and Skip-gram model. For all experiments (unless otherwise stated) we used a vector size of 300 and ran 20 iterations. For the window size parameter study, the window parameter 7000 Table 3: Artificial example to illustrate our N=250 SG N=250 CB intersection method. We calculate the k=5 most N=625 SG N=625 CB N=1000 SG N=1000 CB related words from the word embeddings models 6000 for the word car. The left column shows the most related words to the target word car for the word embeddings model calculated with 20 iterations. 5000 The center column shows the most related words RP-Score to the word car for the word embeddings model 4000 calculated with 19 iterations. The right column displays the intersection of the left and center columns. Irrelevant words (for example animal 3000 or chair ) are dropped from the final result list by the intersection, because they tend to change their positions heavily among result lists from 2000 word embeddings models calculated with different 20 40 60 80 100 parameters. Iterations Iter.=20 Iter.=19 Intersection car car car Figure 7: We evaluated the word2vec iterations auto automobile auto parameter for larger values: Reasonable quality automobile chair automobile of the word embeddings can - for this corpus - animal sheep be obtained with I=20, but slight improvements, doctor egg especially for individual evaluation thesaurus and tree auto training parameter choices, can be achieved with First, we do not need to define a threshold or fixed number larger iteration parameter choices. Note that for of words to form a synset for a given word. Moreover, we a larger number of iterations we conducted fewer experience that bad results are filtered out because their runs, hence the perceived relative smoothness for positions in ranking lists vary a lot. Table 3 illustrates the I>20 is not representative. Best viewed in color. intersection approach. From a different point of view, for a runs from 1 to 20. For the vector size parameter study, specific word, words that are always at the top of different we increase the vector size from 100 to 700 by 100, plus ranking lists form a fix-point set that is stable and as we one run with vector size 50. For the number of iterations, will show in Section 7 returns higher quality results. We we run 1 to 20 iterations incrementally and only selected also found that the resulting intersection lists have different samples up to 100 iterations, cf. Figures 6 and 7. For the sizes that reflect that the approach identifies sensible synset negative (sampling size) we consecutively vary the negative sizes. For example, specific words like scrapping bonus result parameter between 1 and 10. For each parameter study, we in few words in the intersection lists, while words like doctor calculate the RP-Scores using the three different evaluation yield a large number of synonyms in the intersection lists. thesauri as described in Section 3 for each computed word This behavior reflects the human perception of good synsets, embeddings model. too. The intersection of result lists of the size of the vocabulary results in vectors of the size of the vocabulary. Hence, we use 6. INTERSECTION METHOD a fixed number of the first k entries of a result list serving as Finding good parameters for the word2vec algorithm is an intermediate result. This additional parameter serves as an important step towards the creation or extension of an upper bound for the size of returned synsets. Due to the thesauri. However, several challenges remain. A ranking strong variation among result lists obtained from different with respect to the similarity score of all words in the word embeddings models, this upper limit is not assumed vocabulary is not enough. One major challenge is to decide most of the time. which words should be included in a synset and which should not. One possibility is to include a fixed number of 7. EVALUATION related words according to the ranking list, for example the Our evaluation is an attempt to quantify the perception first ten. Another possibility is to define certain thresholds of humans that synsets obtained by the intersection for the similarity score between two words. However, the method have higher quality than synsets obtained using similarity scores are very different among result lists for thresholds. We compare precision/recall values calculated different target words. For example, the first related result per synset and accumulate the individual results. We use in a ranking list for the word law might have a similarity precision (P), recall (R) and F1-Score (F1) defined as score of 0.7 while the first result for the word doctor could follows: have a similarity score of 0.4 while both are considered as true synonyms by humans. TP TP P ∗R During our experiments with the parameters of word2vec, P := R := F 1 := 2 ∗ we recognized that the result lists differ substantially from TP + FP TP + FN P +R one word2vec parameter selection to another. This led us True positives (TP) are present in the evaluation to the idea of calculating intersections of the result lists for thesaurus and in the result of our method. False positives target words. This approach has two advantages at once. (FP) are present in the result of our method, but not in 1000 1000 1000 1000 Skip-gram Skip-gram Skip-gram Skip-gram CBOW CBOW CBOW CBOW 800 800 800 800 Runtime (minutes) Runtime (minutes) Runtime (minutes) Runtime (minutes) 600 600 600 600 400 400 400 400 200 200 200 200 0 0 0 0 0 5 10 15 20 0 200 400 600 1 2 3 4 5 6 7 8 9 10 0 20 40 60 80 100 Window Size Vector Size Negative Samples Iterations Figure 8: The computational costs in terms of runtime for the different parameter selections shown in Figure 6 increase linearly for all investigated parameter values (for the linearly increased parameter values from the parameter study). The calculation of reasonably good word embeddings models takes around one hour for the given dataset on the machine described in Section 3. Note that not all word embeddings models have been calculated in a sequential order of time. Thus, we can exclude cache effects. the synset of the evaluation thesaurus. False negatives the results from the models with 20, 19, 18, etc. iterations. (FN) are present in the synset of the evaluation thesaurus, For I = 19, two lists are intersected obtained from the but not in the result lists. First, we use k=30 as the models with 20 and 19 iterations. Subsequently, result intermediate result list length. lists from one additional word embeddings model are included per data point. All used models were calculated Table 4: The synsets returned by our intersection during the parameter study described in Section 5. method for the target words Abwrackprämie The more that result lists are intersected, the better the (scrapping bonus) and Arzt (doctor). Here, the precision. The increasing precision reflects the opinion of intersection of the result lists calculated from five experts that intersected result lists are much better than different word embeddings models on our training fixed length synsets calculated by word2vec and cosine corpus has been carried out (word embeddings similarity. The recall drops slightly, which can be expected, models with iterations parameter I=20 to I=15 from because the more lists are intersected the fewer entries the parameter study). remain in the final result lists, and a suitable trade-off Tax law-specific example: needs to be chosen. The overall low values of the precision Abwrackprämie (scrapping bonus) stem from the large number of false positives, cf. Table 5, Abwrackhilfe (scrapping help) i.e. results obtained by the word2vec approach, but not Umweltprämie (environmental bonus) present in the thesaurus. Remember that a manually Prämie (bonus) maintained thesaurus is not complete. For example, many spelling errors in the corpus are not reflected in the General example (unspecific to legal domain): thesaurus. Since the creation of a high-quality thesaurus Arzt (doctor) by humans is difficult and expensive, it cannot be excepted that all sensible synonym relations for a huge corpus are Zahnarzt (dentist) present in the thesaurus. We therefore show real results Chefarzt (chief physician) obtained using the intersection method from the corpus for Facharzt (consulting physician) two exemplary target words from our training corpus, in Rechtsanwalt (lawyer) Table 4 (one tax-law specific example and one general, not Assistenzarzt (assistant doctor) law-specific example). Oberarzt (senior physician) The precision values are very low, but an optimization of Tierarzt (veterinarian) the precision values is not the goal here. The relevant Kassenarzt (preferred provider) measure is the change in precision (and recall). Also, the Laborarzt (camp physician) parameter k (the size of the intermediate result list Krankenhausarzt (hospital physician) lengths) needs to be chosen manually. We conducted a parameter study on the parameter k, see Figure 10. The In Figure 9, we present the precision/recall curves for recall drops quite a lot in the beginning, but then stabalizes successively more intersection steps. The points labeled for k>30 while the precision continues to increase. with 20 represent the precision recall values comparing the However, the larger the individual result lists become, the evaluation thesauri with a fixed number of results for a more computing resources are necessary to calculate the target word. This is equivalent to an approach using a intersections. Again, a good trade-off needs to be chosen. fixed synset size and serves as a baseline. The other precision/recall data points are calculated by intersecting Table 5: The detailed F1-Score (F1), precision (P) 0.042 and recall (R) values for the results depicted in N=250 CB 0.040 12 N=625 CB Figure 9 for N=250. For I=20 no intersections have 13 N=1000 CB been calculated (baseline), for I=15 intersections of 14 the result lists from five different word embeddings 0.038 12 15 13 16 models have been computed. True positives (TP), 14 0.036 15 17 false positives (FP) and false negatives (FN) values 12 16 17 are listed. The precision values are very small, 0.034 1314 18 Precision because of the high number of false positives, which 15 18 16 17 reflects that word2vec returns much more words 0.032 19 than a human made thesaurus contains. Note that 18 19 high precision and recall values are not the goal 0.030 19 nor necessary for this evaluation, but the relative improvement. In fact, it is desired to find additional 0.028 20 20 relevant words for inclusion in the thesaurus. I F1 P R TP FP FN 0.026 20 CBOW 0.024 20 0.051 0.027 0.543 475 17135 399 0.44 0.46 0.48 0.50 0.52 0.54 0.56 19 0.060 0.032 0.538 470 14376 404 Recall 18 0.064 0.034 0.535 468 13246 406 17 0.067 0.036 0.532 465 12515 409 0.042 12 N=250 SG 16 0.069 0.037 0.524 458 12002 416 0.040 13 N=625 SG 15 0.070 0.038 0.519 454 11599 420 14 N=1000 SG 12 13 15 14 0.072 0.039 0.517 452 11267 422 0.038 14 13 0.073 0.039 0.514 449 10963 425 12 13 15 16 12 0.075 0.040 0.513 448 10687 426 0.036 14 16 17 Skip-gram 15 17 18 0.034 16 Precision 20 0.052 0.027 0.553 483 17127 391 17 18 19 0.060 0.032 0.538 470 14212 404 18 0.065 0.034 0.530 463 12974 411 0.032 18 19 17 0.067 0.036 0.522 456 12198 418 19 16 0.069 0.037 0.515 450 11653 424 0.030 19 15 0.072 0.038 0.514 449 11232 425 0.028 20 14 0.073 0.039 0.508 444 10852 430 20 13 0.075 0.040 0.505 441 10506 433 0.026 20 12 0.076 0.041 0.497 434 10167 440 0.024 8. DISCUSSION & LIMITATIONS 0.42 0.44 0.46 0.48 0.50 0.52 0.54 0.56 Building synsets by intersecting result lists from word Recall embeddings models calculated with different parameters is a good step towards an automated creation or extension of Figure 9: The precision recall curves show promising thesauri. However, there are several limitations to the results for our intersection method. The upper overall process. An open question remaining is the graph shows the results for the CBOW model, selection of the different word embeddings models that are the lower graph for the Skip-gram model. In used to calculate intersection lists. An efficient approach both graphs, each data point is labeled with the for creating different word embeddings models would be to iterations parameter (I) that the corresponding - dump word embeddings models at checkpoints with the additionally included - word embeddings model has iterations parameter. After each iteration, a word been calculated with. For I=20 no intersection embeddings model could be saved to disk. However, the between result lists for the words of the synset has word2vec algorithm does not support check-pointing (i.e. been calculated yet. For M=19 the intersection resuming training after stopping the algorithm) out of the of the result lists from I=20 and I=19 have been box. Other implementations, such as gensim, do support computed. Subsequently, more and more result check-pointing. In our experience, the intersections from lists are intersected from the word embeddings word embeddings models calculated with varying more models with I=18 to I=12 iterations. More than one parameter give better results. This might be due intersections lead to significant increased precision to larger variation among word embeddings models that while the recall only drops slightly. Please note the differ in multiple parameters (for example, iterations and description for Table 5 regarding the low precision vector size). However, we did not evaluate this in a values. Best viewed in color. structured fashion. Also, we did not evaluate the results words, but we did not tackle the problem of selecting the with experts in a structured way. This is an important target words for which synsets are to be calculated. The next step for the future. Another issue for the automated word2vec implementation comes with a clustering creation of thesauri from scratch is that with the algorithm that can be used to identify different clusters of intersection method we can calculate synsets for given synonymous words. However, the question of which of mostly different types of doctors (dentist, chief of 80 medicine, ENT physician and the alike, where the corresponding German compound words are closed compound words). In contrast to this, the result list for the word physician returns all sorts of different job types, 60 Percentage Improvements / Decline for example, lawyer, teacher and the alike. We also did not include antonyms or a discrimination of synonyms and other types of relations in thesauri. 40 N=250 CBOW F1-Score N=250 CBOW Precision 9. CONCLUSION & FUTURE WORK N=250 CBOW Recall N=250 Skip-gram F1-Score We investigated an approach to extending thesauri for a 20 N=250 Skip-gram Precision N=250 Skip-gram Recall large German tax law text corpus based on word embeddings, in particular with word2vec. These newly detected relations can be presented to experts for possible 0 inclusion in the thesaurus. We use a large existing, manually and specifically for this corpus maintained, thesaurus to identify good parameters for word2vec. We 20 introduced a novel intersection method that intersects result lists of related words calculated by word2vec and 0 100 200 300 400 500 cosine similarity for given target words. We showed that k the intersection method returns synonym sets with higher F1-score and precision. The intersection approach provides Figure 10: We evaluate the size of the result lists an elegant solution to mitigate the problems associated that are the input to our intersection method. Up with fixed length or threshold based approaches to decide to k=40 the recall drops and stabalizes for k>40 on synset sizes. while precision and F1-Score constantly increases. For the future, an interesting question is to understand A decent size of final synonyms that are presented and quantify the impact of extending the corpus before to humans to consider an inclusion in the thesaurus entering the word2vec algorithm. It remains unclear why is desirable. Therefore, we suggest to use a value certain resulting synonym sets encompass certain specific of k around 30 to 40 for this particular corpus and meanings (doctors are related to different types of doctors thesaurus and can report a 50 percent improvement and physicians are related to other professions). Related to in F1-Score. Larger values of k lead to even larger that, a word sense disambiguation could significantly percental improvements. Best viewed in color. improve the quality of resulting synonym sets. We plan to include label propagation approaches based on word clusters can be considered relevant remains. For a practical embeddings and to evaluate results with experts. Finally, a application in a search context, one starting point could be feasible solution to deal with open compound words to find synsets for the words in the most frequent search (n-grams) and the automated selection of target words queries from the users. Search query expansion methods (words that synsets will be calculated for) are important. based on word embeddings could overtake the manual Different similarity measures, investigated for creation of thesauri in many cases. However, the manual distributional thesauri methods, for example by Bullinaria creation of a thesaurus allows for more manual control over and Levy 2007 [2], could be investigated. Finally, it will be search results. necessary to evaluate the suitability of additionally Note that the word2vec implementation is deterministic in suggested synonym relations by our approach with humans. the sense that the exact same corpus and parameter selection results in identical word embeddings models. In our experience, abbreviations tend to give very bad 10. ACKNOWLEDGMENTS results using our approach. We believe that using existing The authors thank DATEV eG for providing the dataset abbreviation lists is the better option in this case. Besides and all people involved for their support. abbreviations, open compound words (terms consisting of multiple words) are problematic, too. Calculating all 11. REFERENCES combinations of multiple words is very time- and resource-intensive for large corpora. One solution is to [1] G. B. Adebayo Kolawole John, Luigi Di Caro and convert known open compound words as single tokens C. Bartolini. - an approach to information retrieval before entering them into the word2vec algorithm. Open and question answering in the legal domain. In compound words can be detected, for example, using the Proceedings of the 10th International Workshop on phrase2vec algorithm and implementation that ships with Juris-informatics (JURISIN 2016), 2016. the word2vec implementation. [2] J. A. Bullinaria and J. P. Levy. Extracting semantic So far, we did not use any sufficiently reliable word sense representations from word co-occurrence statistics: A disambiguation algorithm in our approach (such as computational study. Behavior Research Methods, POS-tagging). Hence, each token can map only to one 39(3):510–526, Aug 2007. vector for all meanings. Moreover, our approach sometimes [3] C. Dirschl. Thesaurus generation and usage at wolters has interesting effects on the meaning of individual words. kluwer germany. IRIS: Internationales For example, the result list to the word doctor yields a list Rechtsinformatik Symposium, Salzburg, Austria, 2016. [4] D. Freitag, M. Blume, J. Byrnes, E. Chow, J. Dean. Distributed representations of words and S. Kapadia, R. Rohwer, and Z. Wang. New phrases and their compositionality. In C. J. C. Burges, experiments in distributional representations of L. Bottou, M. Welling, Z. Ghahramani, and K. Q. synonymy. In Proceedings of the Ninth Conference on Weinberger, editors, Advances in Neural Information Computational Natural Language Learning, pages Processing Systems 26, pages 3111–3119. Curran 25–32. Association for Computational Linguistics, Associates, Inc., 2013. 2005. [18] K. A. Nguyen, S. Schulte im Walde, and N. T. Vu. [5] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word Integrating distributional lexical contrast into word embedding based generalized language model for embeddings for antonym-synonym distinction. In information retrieval. In Proceedings of the 38th Proceedings of the 54th Annual Meeting of the International ACM SIGIR Conference on Research Association for Computational Linguistics (Volume 2: and Development in Information Retrieval, SIGIR ’15, Short Papers), pages 454–459, Berlin, Germany, 2016. pages 795–798, New York, NY, USA, 2015. ACM. Association for Computational Linguistics. [6] Y. Goldberg. A primer on neural network models for [19] M. Ono, M. Miwa, and Y. Sasaki. Word natural language processing. Journal of Artificial embedding-based antonym detection using thesauri Intelligence Research, 57:345–420, 2016. and distributional information. In R. Mihalcea, J. Y. [7] M. Grabmair, K. D. Ashley, R. Chen, P. Sureshkumar, Chai, and A. Sarkar, editors, HLT-NAACL, pages C. Wang, E. Nyberg, and V. R. Walker. Introducing 984–989. The Association for Computational luima: an experiment in legal conceptual retrieval of Linguistics, 2015. vaccine injury decisions using a uima type system and [20] N. Ould Amer, P. Mulhem, and M. Géry. Toward tools. In Proceedings of the 15th International Word Embedding for Personalized Information Conference on Artificial Intelligence and Law, pages Retrieval. In Neu-IR: The SIGIR 2016 Workshop on 69–78. ACM, 2015. Neural Information Retrieval, volume abs/1606.06991, [8] G. Grefenstette. Explorations in automatic thesaurus Pisa, Italy, July 2016. discovery, volume 278. Springer Science & Business [21] J. Pennington, R. Socher, and C. D. Manning. Glove: Media, 1994. Global vectors for word representation. In EMNLP, [9] Z. S. Harris. Distributional structure. Word, volume 14, pages 1532–1543, 2014. 10(2-3):146–162, 1954. [22] P. Resnik. Wordnet and distributional analysis: A [10] J. Karlgren and M. Sahlgren. From words to class-based approach to lexical discovery. In AAAI understanding. Foundations of Real-World workshop on statistically-based natural language Intelligence, pages 294–308, 2001. processing techniques, pages 56–64, 1992. [11] D. Kiela, F. Hill, and S. Clark. Specializing word [23] M. Riedl and C. Biemann. Scaling to large3 data: An embeddings for similarity or relatedness. In efficient and effective method to compute L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and distributional thesauri. In Conference on Empirical Y. Marton, editors, EMNLP, pages 2044–2048. The Methods on Natural Language Processing, pages Association for Computational Linguistics, 2015. 884–890, 2013. [12] J. Landthaler, B. Waltl, P. Holl, and F. Matthes. [24] S. Rothe and H. Schütze. Autoextend: Extending Extending full text search for legal document word embeddings to embeddings for synsets and collections using word embeddings. In F. Bex and lexemes. In Proceedings of the 53rd Annual Meeting of S. Villata, editors, Legal Knowledge and Information the Association for Computational Linguistics and the Systems - JURIX 2016: The Twenty-Ninth Annual 7th International Joint Conference on Natural Conference, volume 294 of Frontiers in Artificial Language Processing (Volume 1: Long Papers), pages Intelligence and Applications, pages 73–82. IOS Press, 1793–1803, Beijing, China, July 2015. Association for 2016. Computational Linguistics. [13] O. Levy and Y. Goldberg. Neural word embedding as [25] H. Rubenstein and J. B. Goodenough. Contextual implicit matrix factorization. In Z. Ghahramani, correlates of synonymy. Communications of the ACM, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. 8(10):627–633, Oct. 1965. Weinberger, editors, Advances in Neural Information [26] P. Rychlý and A. Kilgarriff. An efficient algorithm for Processing Systems 27, pages 2177–2185. Curran building a distributional thesaurus (and other sketch Associates, Inc., 2014. engine developments). In Proceedings of the 45th [14] Q. Lu and J. G. Conrad. Next generation legal Annual Meeting of the ACL on Interactive Poster and search-it’s already here. Vox Populii blog, Legal Demonstration Sessions, ACL ’07, pages 41–44, Information Institute (LII), Cornell University, 2013. Stroudsburg, PA, USA, 2007. Association for [15] S. McDonald and M. Ramscar. Testing the Computational Linguistics. distributional hypothesis: The influence of context on [27] K. Sparck Jones. Synonymy and semantic judgements of semantic similarity. In Proceedings of classification. PhD thesis, University of Edinburgh, the 23rd annual conference of the cognitive science 1964. society, pages 611–616, 2001. [28] H. Zamani and W. B. Croft. Embedding-based query [16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. language models. In Proceedings of the 2016 ACM Efficient estimation of word representations in vector International Conference on the Theory of space. CoRR, abs/1301.3781, 2013. Information Retrieval, ICTIR ’16, pages 147–156, New [17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and York, NY, USA, 2016. ACM.