Extending Thesauri Using Word Embeddings and the
                       Intersection Method

             Jörg Landthaler, Bernhard Waltl, Dominik Huth, Daniel Braun and Florian Matthes
                                           Software Engineering for Business Information Systems
                                                          Department for Informatics
                                                        Technical University of Munich
                                           Boltzmannstr. 3, 85748 Garching bei München, Germany
       joerg.landthaler@tum.de, b.waltl@tum.de, dominik.huth@tum.de, daniel.braun@tum.de,
                                        matthes@in.tum.de

                                                   Christoph Stocker and Thomas Geiger
                                                 Department for Portals and Collaboration (EM45)
                                                                   DATEV eG
                                                  Fürther Straße 111, 90329 Nürnberg, Germany
                                      christoph.stocker@datev.de, thomas.geiger@datev.de

ABSTRACT                                                                           One possibility is the employment of thesauri to capture
In many legal domains, the amount of available and                                 the ambiguous nature of natural language.            Thesauri
relevant literature is continuously growing. Legal content                         capture binary relations such as synonyms or antonyms
providers face the challenge to provide their customers                            and some thesauri additionally cover hierarchical
relevant and comprehensive content for search queries on                           relationships. Large general purpose thesauri have been
large corpora. However, documents written in natural                               built, for example WordNet (Fellbaum, 1998). Thesauri
language contain many synonyms and semantically related                            can be used for search query expansion to increase the
concepts.      Legal content providers usually maintain                            recall of information retrieval systems and particularly to
thesauri to discover more relevant documents in their                              include documents that use synonymous words.
search engines. Maintaining a high-quality thesaurus is an                            Maintaining a thesaurus is expensive and error prone,
expensive, difficult and manual task.             The word                         especially for large thesauri, see for example Dirschl [3]. In
embeddings technology recently gained a lot of attention                           the area of computational linguistics, the automated
for building thesauri from large corpora. We report our                            creation of thesauri has been investigated since the 1950s,
experiences on the feasibility to extend thesauri based on a                       cf. Section 2. The distributional hypothesis claims that
large corpus of German tax law with a focus on synonym                             words that share contexts likely have a more similar
relations. Using a simple yet powerful new approach, called                        meaning (perceived by humans) than others. Since 2013
intersection method, we can significantly improve and                              there has been an increasing interest in a technology called
facilitate the extension of thesauri.                                              word embeddings that combines machine learning (ML)
                                                                                   technologies with the distributional hypothesis. In contrast
                                                                                   to distributed thesauri calculated based on statistical
Keywords                                                                           evaluations, the relatedness of words is calculated in a
thesaurus, synsets, word embeddings, word2vec, parameter                           softer/iterative fashion and is easy to access using the
study, intersection method, tax law                                                cosine similarity measure.
                                                                                      In this paper we investigate the applicability of the word
                                                                                   embeddings technology, in particular the word2vec
1.     INTRODUCTION                                                                implementation [16], to support humans to extend an
   Legal content providers offer their customers access to                         existing thesaurus. While the overall goal is to find new
large amounts of different text documents. Clients expect a                        relevant synonym relations that can be suggested to
search engine that serves all relevant search results in an                        humans to consider for inclusion in the existing thesaurus,
ordered manner with most relevant results at the top. The                          one focus of this paper is how word embeddings can be
expectation of users encompasses that all relevant                                 trained such that the quality of the word embeddings is
documents are returned be a major task in information                              good. The basic assumption is that high-quality word
retrieval and receives much attention, also in the legal                           embeddings will lead to better suggestions for synonym
domain, cf. Qiang and Conrad [14] or Grabmair et al. [7].                          relations that are not present in the current thesaurus.
                                                                                   Related use cases are the creation of thesauri from scratch
In: Proceedings of the Second Workshop on Automated Semantic Analysis of
Information in Legal Text (ASAIL 2017), June 16, 2017, London, UK.
                                                                                   or the automated merging with 3rd party thesauri.
Copyright c 2017 held by the authors. Copying permitted for private and academic      Moreover, an unsolved problem is to determine only
purposes.                                                                          relevant synonyms given a word, i.e. to build sensible
Published at http://ceur-ws.org                                                    synonym sets (synsets). Most approaches need to resort to
                                                                                                       “


                                                                                                            y“
                                                                                                   tion
                                                                                 „fre


                                                                                                           ut
                                                                                                           „d
                                                                                  edo


                                                                                               liga
                                                                                              „o b
                                                                                    “ m
                                                                                                β
                                                                                          α


                                                               Figure 2: Illustration of the characteristics of word
                                                               embeddings in two-dimensional space [12]. The
                                                               angle α between the two vectors for the words
                                                               freedom and obligation is larger than the angle β,
                                                               which reflects that the two words duty and obligation
                                                               are semantically more related than duty and freedom
                                                               or obligation and freedom.
Figure 1: The used text corpus comprises different
document types on the topic of German tax law with             Table 1: The occurrence frequency of a token in
a total amount of approximately 130.000 documents.             the corpus has a strong impact on the quality
The corpus comprises roughly 150 million tokens                of the resulting embeddings for individual words.
yielding a vocabulary size of circa half a million             Therefore, we choose different evaluation selections
words (not pre-processed).                                     of synonym sets from the given thesaurus - that is
                                                               specifically maintained for this corpus - based on the
a fixed amount of relevant words or to rely on the             minimal individual term frequency of the tokens; N
identification of a suitable threshold of relatedness. We      is defined as the minimum occurrence frequency of
investigate a straight-forward approach to identify            each individual token in the corpus to be included
semantically closed synsets without resorting to unreliable    in a synset in a evaluation thesaurus.
thresholds for a large corpus of German tax law and report           N Synsets Terms Terms/Group Relations
our experiences. Using a given thesaurus that is manually          250       260    587            2.26       874
maintained specifically for this corpus, we conduct                625       125    289            2.31       464
parameter studies for the different parameters of word2vec.       1000        84    195            2.32       312
We propose and evaluate a novel approach to intersect
result lists of a relatedness ranking of all words in the      similar or related meanings tend to occur in similar
vocabulary of the corpus.          Multiple word2vec word      contexts. The hypothesis is supported by many studies [25,
embeddings models are calculated with different                15, 10, 22].
parameters. For a given word (target word), we calculate          In 1964, Sparck Jones [27] used the distributional
the relatedness ranking of all words in the corpus and         hypothesis to automatically create thesauri using
intersect the lists of the first top results among the word    count-based methods. Many followed this approach, for
embeddings models calculated with different word2vec           example Grefenstette [8]. Thesauri are useful for several
parameters.     We can report promising results of our         natural language processing problems and much effort has
evaluation of the intersection method with the given corpus    been put into improving distributional thesauri. Rychly
and the corresponding manually maintained thesaurus.           and Kilgarriff [26] developed a system called SketchEngine
  The remainder of this work is organized as follows: In       that efficiently generates distributional thesauri from large
Section 2 we give a brief summary of automatic thesauri        datasets. They pre-process the dataset and remove word
generation and related work. In Section 3 we give an           pairs that have nothing in common before the actual
overview of the corpus and the corresponding manually          calculation is performed.       Hence, their algorithm can
maintained thesaurus used for all our experiments. The         process a dataset with 2 billion words in less than 2 hours
word embeddings technology is introduced in Section 4.         (compared to 300 days without the removal).
We study the different word2vec parameters in Section 5           In their project JoBimText, Riedl and Biemann [23] use a
and present our intersection list method in Section 6. We      parallelized approach based on MapReduce and a Pointwise
evaluate the novel intersection method in Section 7 and        Mutual Information (PMI) measure to improve calculation
discuss limitations in Section 8.         Finally, Section 9   speed as well as the quality of the generated thesaurus.
summarizes our findings and discusses future work.                Word embeddings can be seen as an evolution of
                                                               distributional statistics enhanced with machine learning
                                                               approaches. Traditional distributed thesauri are calculated
2.   RELATED WORK                                              based on co-occurrence counts, while word embeddings
  The manual creation of thesauri is a very labor intensive    leverage sub-sampling methods that are heavily used in the
process. There have been various attempts to automate          machine learning domain. Word embeddings provide an
the process.     A popular approach emerged from the           easy access to word relatedness via the cosine similarity
distributional hypothesis formulated by Harris in 1954 [9].    measure.
The distributional hypothesis claims that words with              Kiela et al. proposed that during the training phase,
word embeddings can be pushed in a particular direction                                         We pre-process both the corpus and the thesaurus. The
[11] and optimized for detecting relatedness.       In the                                   main corpus is pre-processed such that we include only
TOEFL synonym task Freitag et al. [4] report considerably                                    tokens with more than four characters, remove all special
better result than for non-specialized embeddings.                                           characters and punctuation and lowercase all tokens. A
Thesauri often not only contain synonyms, but also                                           single line with all cleaned and white-space-separated
antonyms. Ono et al. [19] presented an approach to detect                                    tokens is entered into the word2vec algorithm. For the
antonyms, using word embeddings and distributional                                           thesaurus, we additionally restrict ourselves to terms that
information. Nguyen et al. [18] improve the discrimination                                   consist of a single token. It is well known that the
of antonyms and synonyms by integrating lexical contrast                                     occurrence frequency of words is crucial to the quality of
into their vector representation.       In 2015, Rothe and                                   resulting word embeddings. We extracted three different
Schütze [24] presented AutoExtend, an extension of word                                     evaluation sets where all words in the evaluation thesaurus
embeddings, optimized to train embedding vectors for                                         occur at least N={250,650,1000} times in the corpus, see
synonym sets (one vector per synset) and their composing                                     Table 1.
lexemes. To the best of our knowledge and in contrast to                                        All experiments have been carried out on an Intel Core
our intersection method, all approaches use fixed-length                                     i5-2500 (4x2333MHz) machine with 8 GB DDR3-1333 RAM
result sets or fixed thresholds to build synsets.                                            running Ubuntu 14.04, Python 2.7.6, Numpy 1.13.1, scipy
  Recently, word embeddings have been used for                                               0.16.1 and word2vec 0.1c (only versions ≥ 0.1c support the
query-expansion for information retrieval directly, i.e.                                     iterations parameter).
without the creation of knowledge-bases. Examples for
such query expansion using word embeddings are Ganguly
et al., Zamani et al. and Amer et al. [5, 28, 20]. Query
                                                                                             4.    WORD EMBEDDINGS
expansion using word embeddings specialized for the legal                                       Word embeddings are a family of algorithms producing
domain has recently been proposed by Adebayo et al. [1].                                     dense vectors that represent words in the vocabulary of a
                                                                                             corpus. The word embeddings can be trained using the
                                                                                             Continuous Bag-of-words (CBOW) or Skip-gram training
3.       DATASET & PRE-PROCESSING                                                            models depicted in Figure 3.
  We conduct our parameter studies and the evaluation of                                        Word embeddings combine the distributional hypothesis
our intersection method for extending thesauri on a legal                                    with artificial neural networks [13]. Due to new efficient
texts corpus provided by our industry partner DATEV eG.                                      methods of calculating word embeddings, Mikolov et al.
The corpus comprises different document types on the                                         [16], word embeddings for several gigabytes of text data
topic of German tax law, cf. Figure 3. The corpus of 150                                     can be calculated within hours. While word embeddings
million pre-processed tokens yields a vocabulary of circa                                    are still considered a Bag-of-words approach, word
180.000 entries. Our industry partner manually maintains                                     embeddings do encode the general context of words in
a high-quality thesaurus specifically for this corpus                                        dense vectors.     Mathematical operations, for example
including approximately 12.000 synonym sets with around                                      vector addition, can be carried out on the vectors while
36.000 terms.                                                                                preserving their inherent semantic characteristics. Mikolov
         Input   Projection    Output                         Projection   Output
                                                      Input                                  et al [17] show that word embeddings trained on fictional
w(t-2)                                                                              w(t-2)   English literature capture semantic relationships among
                                                                                             words. We illustrate such semantic relationships encoded
w(t-1)                                                                              w(t-1)   in word embeddings in Figure 2. We noticed that relevant
                     SUM                                        SUM                          characteristics are recognizable even for word embeddings
                                                                                             trained on comparably very small training corpora [12], at
                                        w(t)   w(t)
                                                                                             least regarding text similarity tasks. Hence, we assume
                                                                                             that our corpus with 150 million tokens is large enough to
w(t+1)                                                                              w(t+1)
                                                                                             produce word embeddings with sufficient quality.
                                                                                                Next, we give a short summary of a selection of the most
w(t+2)                                                                              w(t+2)   important implementations to calculate word embeddings:
                        CBOW                                  Skip-gram
                                                                                                  • word2vec: The original C++ implementation of
                                                                                                    word2vec by Mikolov et al. [16], [17] is very fast. It
Figure 3: Continuous Bag-of-words (CBOW, CB)                                                        provides a multi-threaded implementation, but it
and Skip-gram (SG) training model illustrations                                                     does not support check-pointing (i.e.       resuming
adapted from [16].        The word2vec algorithm                                                    computations after stopping).1
implements both training models. The basic idea
behind the two training models is that either a word                                              • gensim: The gensim implementation of word2vec
is used to predict the context of it or the other way                                               provides a Python interface to calculate word
around - to use the context to predict a current                                                    embeddings and supports check-pointing.2
word. This task is iterated word by word over
the corpus. The prediction can be thought of as                                                   • Apache Spark: Apache Spark includes a Java/Scala
a classifier from machine learning. The vectors are                                                 implementation of word2vec that can be run in a
collected from the weights of the artificial neural                                          1
                                                                                               https://github.com/kzhai/word2vec, word2vec version
network that serves as a classifier. Hence, large                                            0.1c for MAC OS X, accessed on 22/January/2017
amounts of text can be used in an unsupervised                                               2
                                                                                               https://github.com/nicholas-leonard/word2vec, accessed
fashion to train word embeddings.                                                            on 22/January/2017
                                                                                                              Legend
                   Overall                              Word Embeddings          Similarity
                               Corpus    Word2Vec                                               Synsets          Artifact
                  Approach                                   Model              Calculation
                                                                                                                Algorithm
                 Experiments            Parameters                                            Intersections


Figure 4: Illustration of our overall approach: word2vec parameters need to be chosen. Our intersection
method can be applied after the (repeated) calculation of (fixed length) synsets.

       Hadoop cluster.3                                                     around each word during the training phase. Arguing
                                                                            from a statistical linguistics point of view, large
     • deeplearning4j:     This Java implementation is                      (word-)distances between two words (for example in
       embedded in the general deeplearning4j framework                     two consecutive sentences) usually lead to less
       for machine learning and is similar to gensim but                    influence of the words on each other [6].
       implemented for usage with Java and Scala.4
                                                                          • Iterations (I)): The iterations parameter defines
     • GloVe: GloVe vectors, presented by Pennington et                     the number of iterations over the full corpus and can
       al., [21] are a count-based algorithm implemented in C.              be thought of as an artificial enlargement of the
       GloVe vectors are similar to the word2vec embeddings,                corpus.     The choice for this parameter heavily
       but optimize a different objective function.5                        depends on the corpus size. A larger number of
                                                                            iterations is particularly useful for small corpora.
     • Torch: Torch is, similar to the well-known Tensorflow
       framework, a deep learning and scientific computing                • Minimum Count: The occurrence frequency of
       framework that provides a Lua interface to a word2vec                individual tokens has a strong impact on the quality
       implementation.6                                                     of the resulting word embeddings.        Using the
                                                                            minimum count parameter, one can control that
     • Tensorflow:      Tensorflow is a deep learning                       words occur sufficiently often in the corpus. The
       framework provided by Google and offers (among                       downside of this parameter is that words in the
       others) a Python interface to its own word2vec                       vocabulary that do not occur often enough in the
       implementation.7                                                     corpus will not have a vector.

5.    WORD2VEC PARAMETERS                                                 • Alpha: The initial learning rate is a parameter that is
                                                                            derived from artificial neural network training and not
  The word2vec implementation has a large number of                         investigated, because we assume that it is very specific
parameters.     For the most relevant parameters, we                        to a concrete dataset.
conducted a parameter study using a manually maintained
thesaurus as the ground truth for an evaluation of the                    • Negative:    The number of negative examples
quality of the resulting word embeddings.          While a                  presented during the training. Consult [6] for a
thesaurus by nature cannot be perfectly sharp, we assume                    comprehensive explanation.
that relations identified by humans have sufficient truth
and by using a large number of relations identified by                    • CBOW or Skip-gram: The two possible training
humans our assumption is that this is sufficient to identify                models that can be used to train word embeddings
general trends. The following list gives an overview of the                 with word2vec. Either the current word is used to
most important parameters:                                                  predict the context of the current word or vice versa
                                                                            the context is used to predict the current word, cf.
     • Size (Dimensionality): The size of the resulting                     Figure 3. In our experiments the CBOW model results
       vectors is chosen manually. From an information                      faster in high quality word embeddings in less training
       entropy point of view this value needs to be large                   time.
       enough so that all relevant information can be
                                                                      We assume that the vector size, window size, negative and
       encoded in the vectors. However, the larger the
                                                                      iterations parameters are the most important parameters for
       vector size is chosen, the more computationally
                                                                      the creation or extension of a thesaurus given a corpus. We
       expensive training and subsequent calculations
                                                                      set the minimum count parameter to zero, because we want
       become.
                                                                      a vector for each word present in the corpus.
     • Window Size: The window size is a training                        The manually maintained thesaurus for our corpus
       parameter that defines the size of the context window          contains groups of synonymously used words. A simple
3
                                                                      example for such a group of synonyms is (lawsuit, case,
  https://spark.apache.org/, accessed on 22/January/2017              dispute). We are interested in how the vector size, window
4
  https://deeplearning4j.org/word2vec.html, accessed on               size, iterations and negative parameters affect similarity
22/January/2017                                                       score lists calculated based on word embeddings trained by
5
  http://nlp.stanford.edu/projects/glove/,   accessed on              word2vec.      We introduce the RP-Score measure that
22/January/2017
6                                                                     measures the average positions of synonyms in ranking lists
  https://github.com/yoonkim/word2vec torch, accessed on
22/January/2017                                                       of the terms. The RP-Score provides a measure to compare
7
  https://github.com/tensorflow/tensorflow/tree/master/               the relateness of two words obtained from humans and
tensorflow/examples/tutorials/word2vec,      accessed   on            obtained by our word2vec approach. In contrast to the
22/January/2017                                                       mean reciprocal rank (MRR), the RP-Score is not
           controversial                             litigation
                                      law-suit                             8000
                                                                                                  N=250 SG               N=250 CB
                                                                           7000                   N=625 SG               N=625 CB
                dispute                          lawsuit
                                                                           6000                   N=1000 SG              N=1000 CB
                                                       case


                                                                    RP-Score
                                                                           5000
Figure 5: The cosine similarity distance measure is                        4000
symmetric between (the word embeddings vectors
of ) two words. The ranking positions of two words                         3000
can be asymmetric, because other words can be                              2000
closer to one of the two words, but more distant                                         5           10             15               20
to the other word, cf. Table 2.
                                                                           8000                   Window Size
                                                                                                  N=250 SG               N=250 CB
                                                                           7000                   N=625 SG               N=625 CB
Table 2:     For our parameter study we use a                              6000                   N=1000 SG              N=1000 CB


                                                                    RP-Score
measure that we call RP-Score. Using the cosine
similarity measure, we compute an ordered list of
                                                                           5000
the relatedness of all words in the vocabulary to one                      4000
selected word from the vocabulary (target word).                           3000
For three target words, the five most related words
are listed. The word dispute has the ranking position
                                                                           2000
(RP) 4 for the target word lawsuit. The RP-Score                                   100   200    300 400 500 600 700
for this example synset is 1+4+2+5+3+4    ≈ 3.16.                         8000                    Vector Size
                                   6
                                                                                                  N=250 SG    N=250 CB
        RP    lawsuit        case       dispute                           7000                    N=625 SG    N=625 CB
           1      case        trial controversial                         6000                    N=1000 SG   N=1000 CB
           2   law-suit  lawsuit             trial                  RP-Score
                                                                          5000
           3 litigation litigation      lawsuit
           4 dispute      law-suit          case                          4000
           5       trial dispute       litigation                         3000
normalized to 1 and provides a more intuitive                             2000
understanding for the quality of a word embeddings model.                      1     2   3     4    5      6    7        8   9       10
  The RP-Score measure is calculated as follows: For all                 16000                  Negative Samples
target words in a synset we calculate a sorted list of all               14000                    N=250 SG               N=250 CB
                                                                                                  N=625 SG               N=625 CB
words in the vocabulary using the cosine similarity
                                                                         12000                    N=1000 SG              N=1000 CB
measure. The ranking list is ordered and most related
                                                                         10000
                                                                    RP-Score


words are at the top of the list. We determine the position
for each combination of two words in a synset and                         8000
accumulate all ranking positions.           We perform this               6000
calculation on all synsets and finally divide the value by                4000
the total number of relations among all words in a synset
                                                                          2000
and aggregate among all synsets. RPw1 (w2 ) is defined as                                5            10            15               20
the position of w2 in the result list of w1 .                                                      Iterations
                                                                    Figure 6: Parameter studies of different relevant
                                         P
                                                       RPw1 (w2 )
                           X      w1 ,w2 ∈s,w1 6=w2                 word2vec parameters: window size, vector size,
     RP Score :=
                      s∈synsets
                                          |s|(|s| − 1)              negative samples and number of iterations. For all
                                                                    parameters the computing costs increase with larger
  Note that the RP-Scores are not a sharp measure in our            parameter values (cf. Figure 8) and a trade-off
evaluation. On the one hand, a thesaurus maintained by              between computing costs and quality is inevitable.
humans can lack (and contain) similarity relations that are         We measure the impact of the different word2vec
included (or ignored) by an unsupervised word embeddings            parameters on the quality of the resulting word
calculation. For example, spelling mistakes are often not           embeddings using the RP-Score with a thesaurus
contained in a thesaurus but detected by our overall                that is manually and specifially maintained for
approach. Wrongly typed words can be good candidates                the given text corpus. Good parameter choices
for an inclusion in a thesaurus, because documents like             have a reasonably small RP-Score. For example,
judgments cannot be corrected. Nevertheless, we are able            the default window size of 5 from the word2vec
to observe reasonable results in our parameter studies, due         implementation is a good choice already. Note that
to the averaging of the RP-Score over a larger number of            the CBOW training model results in better RP-
evaluation synsets.                                                 Scores while having smaller runtime values for equal
  For all parameters investigated, we applied both CBOW             parameter choices. Best viewed in color.
and Skip-gram model. For all experiments (unless otherwise
stated) we used a vector size of 300 and ran 20 iterations.         For the window size parameter study, the window parameter
         7000
                                                                        Table 3:       Artificial example to illustrate our
                                         N=250 SG         N=250 CB      intersection method. We calculate the k=5 most
                                         N=625 SG         N=625 CB
                                         N=1000 SG        N=1000 CB     related words from the word embeddings models
         6000                                                           for the word car. The left column shows the most
                                                                        related words to the target word car for the word
                                                                        embeddings model calculated with 20 iterations.
         5000                                                           The center column shows the most related words
  RP-Score


                                                                        to the word car for the word embeddings model
         4000                                                           calculated with 19 iterations. The right column
                                                                        displays the intersection of the left and center
                                                                        columns.      Irrelevant words (for example animal
         3000                                                           or chair ) are dropped from the final result list
                                                                        by the intersection, because they tend to change
                                                                        their positions heavily among result lists from
         2000
                                                                        word embeddings models calculated with different
                  20        40                60     80           100   parameters.
                                 Iterations                                     Iter.=20        Iter.=19  Intersection
                                                                                    car            car         car
Figure 7: We evaluated the word2vec iterations                                     auto        automobile     auto
parameter for larger values: Reasonable quality                               automobile          chair    automobile
of the word embeddings can - for this corpus -                                   animal           sheep
be obtained with I=20, but slight improvements,                                   doctor           egg
especially for individual evaluation thesaurus and                                 tree           auto
training parameter choices, can be achieved with                        First, we do not need to define a threshold or fixed number
larger iteration parameter choices. Note that for                       of words to form a synset for a given word. Moreover, we
a larger number of iterations we conducted fewer                        experience that bad results are filtered out because their
runs, hence the perceived relative smoothness for                       positions in ranking lists vary a lot. Table 3 illustrates the
I>20 is not representative. Best viewed in color.                       intersection approach. From a different point of view, for a
runs from 1 to 20. For the vector size parameter study,                 specific word, words that are always at the top of different
we increase the vector size from 100 to 700 by 100, plus                ranking lists form a fix-point set that is stable and as we
one run with vector size 50. For the number of iterations,              will show in Section 7 returns higher quality results. We
we run 1 to 20 iterations incrementally and only selected               also found that the resulting intersection lists have different
samples up to 100 iterations, cf. Figures 6 and 7. For the              sizes that reflect that the approach identifies sensible synset
negative (sampling size) we consecutively vary the negative             sizes. For example, specific words like scrapping bonus result
parameter between 1 and 10. For each parameter study, we                in few words in the intersection lists, while words like doctor
calculate the RP-Scores using the three different evaluation            yield a large number of synonyms in the intersection lists.
thesauri as described in Section 3 for each computed word               This behavior reflects the human perception of good synsets,
embeddings model.                                                       too.
                                                                           The intersection of result lists of the size of the vocabulary
                                                                        results in vectors of the size of the vocabulary. Hence, we use
6.           INTERSECTION METHOD                                        a fixed number of the first k entries of a result list serving as
   Finding good parameters for the word2vec algorithm is                an intermediate result. This additional parameter serves as
an important step towards the creation or extension of                  an upper bound for the size of returned synsets. Due to the
thesauri. However, several challenges remain. A ranking                 strong variation among result lists obtained from different
with respect to the similarity score of all words in the                word embeddings models, this upper limit is not assumed
vocabulary is not enough. One major challenge is to decide              most of the time.
which words should be included in a synset and which
should not. One possibility is to include a fixed number of             7.    EVALUATION
related words according to the ranking list, for example the               Our evaluation is an attempt to quantify the perception
first ten. Another possibility is to define certain thresholds          of humans that synsets obtained by the intersection
for the similarity score between two words. However, the                method have higher quality than synsets obtained using
similarity scores are very different among result lists for             thresholds. We compare precision/recall values calculated
different target words. For example, the first related result           per synset and accumulate the individual results. We use
in a ranking list for the word law might have a similarity              precision (P), recall (R) and F1-Score (F1) defined as
score of 0.7 while the first result for the word doctor could           follows:
have a similarity score of 0.4 while both are considered as
true synonyms by humans.
                                                                                      TP                TP                      P ∗R
   During our experiments with the parameters of word2vec,                   P :=              R :=                F 1 := 2 ∗
we recognized that the result lists differ substantially from                       TP + FP           TP + FN                   P +R
one word2vec parameter selection to another. This led us                  True positives (TP) are present in the evaluation
to the idea of calculating intersections of the result lists for        thesaurus and in the result of our method. False positives
target words. This approach has two advantages at once.                 (FP) are present in the result of our method, but not in
         1000                                                1000                                             1000                                         1000
                         Skip-gram                                           Skip-gram                                        Skip-gram                                    Skip-gram
                         CBOW                                                CBOW                                             CBOW                                         CBOW
              800                                                 800                                              800                                          800
Runtime (minutes)


                                                    Runtime (minutes)


                                                                                                     Runtime (minutes)


                                                                                                                                                  Runtime (minutes)
              600                                                 600                                              600                                          600


              400                                                 400                                              400                                          400


              200                                                 200                                              200                                          200


                    0                                                   0                                                0                                            0
                     0    5       10      15   20                        0    200      400     600                        1 2 3 4 5 6 7 8 9 10                         0   20   40 60        80   100
                              Window Size                                        Vector Size                                   Negative Samples                                 Iterations

         Figure 8: The computational costs in terms of runtime for the different parameter selections shown in Figure
         6 increase linearly for all investigated parameter values (for the linearly increased parameter values from the
         parameter study). The calculation of reasonably good word embeddings models takes around one hour for
         the given dataset on the machine described in Section 3. Note that not all word embeddings models have
         been calculated in a sequential order of time. Thus, we can exclude cache effects.

         the synset of the evaluation thesaurus. False negatives                                              the results from the models with 20, 19, 18, etc. iterations.
         (FN) are present in the synset of the evaluation thesaurus,                                          For I = 19, two lists are intersected obtained from the
         but not in the result lists. First, we use k=30 as the                                               models with 20 and 19 iterations. Subsequently, result
         intermediate result list length.                                                                     lists from one additional word embeddings model are
                                                                                                              included per data point. All used models were calculated
         Table 4: The synsets returned by our intersection                                                    during the parameter study described in Section 5.
         method for the target words Abwrackprämie                                                              The more that result lists are intersected, the better the
         (scrapping bonus) and Arzt (doctor). Here, the                                                       precision. The increasing precision reflects the opinion of
         intersection of the result lists calculated from five                                                experts that intersected result lists are much better than
         different word embeddings models on our training                                                     fixed length synsets calculated by word2vec and cosine
         corpus has been carried out (word embeddings                                                         similarity. The recall drops slightly, which can be expected,
         models with iterations parameter I=20 to I=15 from                                                   because the more lists are intersected the fewer entries
         the parameter study).                                                                                remain in the final result lists, and a suitable trade-off
                      Tax law-specific example:                                                               needs to be chosen. The overall low values of the precision
                  Abwrackprämie (scrapping bonus)                                                            stem from the large number of false positives, cf. Table 5,
                      Abwrackhilfe (scrapping help)                                                           i.e. results obtained by the word2vec approach, but not
                   Umweltprämie (environmental bonus)                                                        present in the thesaurus. Remember that a manually
                             Prämie (bonus)                                                                  maintained thesaurus is not complete. For example, many
                                                                                                              spelling errors in the corpus are not reflected in the
                    General example (unspecific to legal domain):                                             thesaurus. Since the creation of a high-quality thesaurus
                                     Arzt (doctor)                                                            by humans is difficult and expensive, it cannot be excepted
                                                                                                              that all sensible synonym relations for a huge corpus are
                                    Zahnarzt (dentist)
                                                                                                              present in the thesaurus. We therefore show real results
                               Chefarzt (chief physician)
                                                                                                              obtained using the intersection method from the corpus for
                             Facharzt (consulting physician)
                                                                                                              two exemplary target words from our training corpus, in
                                  Rechtsanwalt (lawyer)                                                       Table 4 (one tax-law specific example and one general, not
                             Assistenzarzt (assistant doctor)                                                 law-specific example).
                               Oberarzt (senior physician)                                                       The precision values are very low, but an optimization of
                                 Tierarzt (veterinarian)                                                      the precision values is not the goal here. The relevant
                             Kassenarzt (preferred provider)                                                  measure is the change in precision (and recall). Also, the
                              Laborarzt (camp physician)                                                      parameter k (the size of the intermediate result list
                          Krankenhausarzt (hospital physician)                                                lengths) needs to be chosen manually. We conducted a
                                                                                                              parameter study on the parameter k, see Figure 10. The
            In Figure 9, we present the precision/recall curves for
                                                                                                              recall drops quite a lot in the beginning, but then stabalizes
         successively more intersection steps. The points labeled
                                                                                                              for k>30 while the precision continues to increase.
         with 20 represent the precision recall values comparing the
                                                                                                              However, the larger the individual result lists become, the
         evaluation thesauri with a fixed number of results for a
                                                                                                              more computing resources are necessary to calculate the
         target word. This is equivalent to an approach using a
                                                                                                              intersections. Again, a good trade-off needs to be chosen.
         fixed synset size and serves as a baseline. The other
         precision/recall data points are calculated by intersecting
Table 5: The detailed F1-Score (F1), precision (P)                    0.042
and recall (R) values for the results depicted in                                                                                      N=250 CB
                                                                      0.040                                                   12       N=625 CB
Figure 9 for N=250. For I=20 no intersections have                                                                            13       N=1000 CB
been calculated (baseline), for I=15 intersections of                                                                           14
the result lists from five different word embeddings                  0.038                 12                                   15
                                                                                            13                                     16
models have been computed. True positives (TP),                                              14
                                                                      0.036                   15                                      17
false positives (FP) and false negatives (FN) values                             12             16 17
are listed. The precision values are very small,                      0.034      1314                                                       18


                                                              Precision
because of the high number of false positives, which                                15                         18
                                                                                         16 17
reflects that word2vec returns much more words                        0.032                                                                  19
than a human made thesaurus contains. Note that                                              18                 19
high precision and recall values are not the goal                     0.030                          19
nor necessary for this evaluation, but the relative
improvement. In fact, it is desired to find additional                0.028
                                                                                                                          20                      20
relevant words for inclusion in the thesaurus.
       I     F1      P      R TP        FP FN                         0.026                               20
      CBOW
                                                                      0.024
      20 0.051 0.027 0.543 475 17135 399                                  0.44    0.46          0.48            0.50           0.52         0.54       0.56
      19 0.060 0.032 0.538 470 14376 404                                                                       Recall
      18 0.064 0.034 0.535 468 13246 406
      17 0.067 0.036 0.532 465 12515 409                              0.042
                                                                                                                         12                 N=250 SG
      16 0.069 0.037 0.524 458 12002 416
                                                                      0.040                                                   13            N=625 SG
      15 0.070 0.038 0.519 454 11599 420                                                                                       14           N=1000 SG
                                                                                          12 13                                  15
      14 0.072 0.039 0.517 452 11267 422
                                                                      0.038                   14
      13 0.073 0.039 0.514 449 10963 425                                          12 13              15                           16
      12 0.075 0.040 0.513 448 10687 426                              0.036           14              16                               17
      Skip-gram                                                                            15           17                                  18
                                                                      0.034                16
                                                              Precision

      20 0.052 0.027 0.553 483 17127 391
                                                                                                17             18
      19 0.060 0.032 0.538 470 14212 404
      18 0.065 0.034 0.530 463 12974 411                              0.032                           18                                         19
      17 0.067 0.036 0.522 456 12198 418                                                                            19
      16 0.069 0.037 0.515 450 11653 424                              0.030                               19
      15 0.072 0.038 0.514 449 11232 425
                                                                      0.028                                                                            20
      14 0.073 0.039 0.508 444 10852 430                                                                                       20
      13 0.075 0.040 0.505 441 10506 433                              0.026                                     20
      12 0.076 0.041 0.497 434 10167 440
                                                                      0.024
8.   DISCUSSION & LIMITATIONS                                             0.42   0.44      0.46           0.48 0.50                 0.52     0.54      0.56
   Building synsets by intersecting result lists from word                                                   Recall
embeddings models calculated with different parameters is
a good step towards an automated creation or extension of     Figure 9: The precision recall curves show promising
thesauri. However, there are several limitations to the       results for our intersection method. The upper
overall process.      An open question remaining is the       graph shows the results for the CBOW model,
selection of the different word embeddings models that are    the lower graph for the Skip-gram model.           In
used to calculate intersection lists. An efficient approach   both graphs, each data point is labeled with the
for creating different word embeddings models would be to     iterations parameter (I) that the corresponding -
dump word embeddings models at checkpoints with the           additionally included - word embeddings model has
iterations parameter.       After each iteration, a word      been calculated with. For I=20 no intersection
embeddings model could be saved to disk. However, the         between result lists for the words of the synset has
word2vec algorithm does not support check-pointing (i.e.      been calculated yet. For M=19 the intersection
resuming training after stopping the algorithm) out of the    of the result lists from I=20 and I=19 have been
box. Other implementations, such as gensim, do support        computed. Subsequently, more and more result
check-pointing. In our experience, the intersections from     lists are intersected from the word embeddings
word embeddings models calculated with varying more           models with I=18 to I=12 iterations.            More
than one parameter give better results. This might be due     intersections lead to significant increased precision
to larger variation among word embeddings models that         while the recall only drops slightly. Please note the
differ in multiple parameters (for example, iterations and    description for Table 5 regarding the low precision
vector size). However, we did not evaluate this in a          values. Best viewed in color.
structured fashion. Also, we did not evaluate the results     words, but we did not tackle the problem of selecting the
with experts in a structured way. This is an important        target words for which synsets are to be calculated. The
next step for the future. Another issue for the automated     word2vec implementation comes with a clustering
creation of thesauri from scratch is that with the            algorithm that can be used to identify different clusters of
intersection method we can calculate synsets for given        synonymous words.      However, the question of which
                                                                                     of mostly different types of doctors (dentist, chief of
                                     80                                              medicine, ENT physician and the alike, where the
                                                                                     corresponding German compound words are closed
                                                                                     compound words). In contrast to this, the result list for
                                                                                     the word physician returns all sorts of different job types,
                                     60
 Percentage Improvements / Decline


                                                                                     for example, lawyer, teacher and the alike. We also did not
                                                                                     include antonyms or a discrimination of synonyms and
                                                                                     other types of relations in thesauri.
                                     40                N=250 CBOW F1-Score
                                                       N=250 CBOW Precision          9.    CONCLUSION & FUTURE WORK
                                                       N=250 CBOW Recall
                                                       N=250 Skip-gram F1-Score         We investigated an approach to extending thesauri for a
                                     20                N=250 Skip-gram Precision
                                                       N=250 Skip-gram Recall        large German tax law text corpus based on word
                                                                                     embeddings, in particular with word2vec. These newly
                                                                                     detected relations can be presented to experts for possible
                                      0                                              inclusion in the thesaurus.      We use a large existing,
                                                                                     manually and specifically for this corpus maintained,
                                                                                     thesaurus to identify good parameters for word2vec. We
                                     20                                              introduced a novel intersection method that intersects
                                                                                     result lists of related words calculated by word2vec and
                                       0   100   200       300      400        500   cosine similarity for given target words. We showed that
                                                       k
                                                                                     the intersection method returns synonym sets with higher
                                                                                     F1-score and precision. The intersection approach provides
Figure 10: We evaluate the size of the result lists                                  an elegant solution to mitigate the problems associated
that are the input to our intersection method. Up                                    with fixed length or threshold based approaches to decide
to k=40 the recall drops and stabalizes for k>40                                     on synset sizes.
while precision and F1-Score constantly increases.                                      For the future, an interesting question is to understand
A decent size of final synonyms that are presented                                   and quantify the impact of extending the corpus before
to humans to consider an inclusion in the thesaurus                                  entering the word2vec algorithm. It remains unclear why
is desirable. Therefore, we suggest to use a value                                   certain resulting synonym sets encompass certain specific
of k around 30 to 40 for this particular corpus and                                  meanings (doctors are related to different types of doctors
thesaurus and can report a 50 percent improvement                                    and physicians are related to other professions). Related to
in F1-Score. Larger values of k lead to even larger                                  that, a word sense disambiguation could significantly
percental improvements. Best viewed in color.                                        improve the quality of resulting synonym sets. We plan to
                                                                                     include label propagation approaches based on word
clusters can be considered relevant remains. For a practical                         embeddings and to evaluate results with experts. Finally, a
application in a search context, one starting point could be                         feasible solution to deal with open compound words
to find synsets for the words in the most frequent search                            (n-grams) and the automated selection of target words
queries from the users. Search query expansion methods                               (words that synsets will be calculated for) are important.
based on word embeddings could overtake the manual                                   Different     similarity   measures,     investigated      for
creation of thesauri in many cases. However, the manual                              distributional thesauri methods, for example by Bullinaria
creation of a thesaurus allows for more manual control over                          and Levy 2007 [2], could be investigated. Finally, it will be
search results.                                                                      necessary to evaluate the suitability of additionally
  Note that the word2vec implementation is deterministic in                          suggested synonym relations by our approach with humans.
the sense that the exact same corpus and parameter selection
results in identical word embeddings models.
  In our experience, abbreviations tend to give very bad                             10.   ACKNOWLEDGMENTS
results using our approach. We believe that using existing                             The authors thank DATEV eG for providing the dataset
abbreviation lists is the better option in this case. Besides                        and all people involved for their support.
abbreviations, open compound words (terms consisting of
multiple words) are problematic, too.         Calculating all                        11.   REFERENCES
combinations of multiple words is very time- and
resource-intensive for large corpora. One solution is to                              [1] G. B. Adebayo Kolawole John, Luigi Di Caro and
convert known open compound words as single tokens                                        C. Bartolini. - an approach to information retrieval
before entering them into the word2vec algorithm. Open                                    and question answering in the legal domain. In
compound words can be detected, for example, using the                                    Proceedings of the 10th International Workshop on
phrase2vec algorithm and implementation that ships with                                   Juris-informatics (JURISIN 2016), 2016.
the word2vec implementation.                                                          [2] J. A. Bullinaria and J. P. Levy. Extracting semantic
  So far, we did not use any sufficiently reliable word sense                             representations from word co-occurrence statistics: A
disambiguation algorithm in our approach (such as                                         computational study. Behavior Research Methods,
POS-tagging). Hence, each token can map only to one                                       39(3):510–526, Aug 2007.
vector for all meanings. Moreover, our approach sometimes                             [3] C. Dirschl. Thesaurus generation and usage at wolters
has interesting effects on the meaning of individual words.                               kluwer germany. IRIS: Internationales
For example, the result list to the word doctor yields a list                             Rechtsinformatik Symposium, Salzburg, Austria, 2016.
 [4] D. Freitag, M. Blume, J. Byrnes, E. Chow,                     J. Dean. Distributed representations of words and
     S. Kapadia, R. Rohwer, and Z. Wang. New                       phrases and their compositionality. In C. J. C. Burges,
     experiments in distributional representations of              L. Bottou, M. Welling, Z. Ghahramani, and K. Q.
     synonymy. In Proceedings of the Ninth Conference on           Weinberger, editors, Advances in Neural Information
     Computational Natural Language Learning, pages                Processing Systems 26, pages 3111–3119. Curran
     25–32. Association for Computational Linguistics,             Associates, Inc., 2013.
     2005.                                                    [18] K. A. Nguyen, S. Schulte im Walde, and N. T. Vu.
 [5] D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word           Integrating distributional lexical contrast into word
     embedding based generalized language model for                embeddings for antonym-synonym distinction. In
     information retrieval. In Proceedings of the 38th             Proceedings of the 54th Annual Meeting of the
     International ACM SIGIR Conference on Research                Association for Computational Linguistics (Volume 2:
     and Development in Information Retrieval, SIGIR ’15,          Short Papers), pages 454–459, Berlin, Germany, 2016.
     pages 795–798, New York, NY, USA, 2015. ACM.                  Association for Computational Linguistics.
 [6] Y. Goldberg. A primer on neural network models for       [19] M. Ono, M. Miwa, and Y. Sasaki. Word
     natural language processing. Journal of Artificial            embedding-based antonym detection using thesauri
     Intelligence Research, 57:345–420, 2016.                      and distributional information. In R. Mihalcea, J. Y.
 [7] M. Grabmair, K. D. Ashley, R. Chen, P. Sureshkumar,           Chai, and A. Sarkar, editors, HLT-NAACL, pages
     C. Wang, E. Nyberg, and V. R. Walker. Introducing             984–989. The Association for Computational
     luima: an experiment in legal conceptual retrieval of         Linguistics, 2015.
     vaccine injury decisions using a uima type system and    [20] N. Ould Amer, P. Mulhem, and M. Géry. Toward
     tools. In Proceedings of the 15th International               Word Embedding for Personalized Information
     Conference on Artificial Intelligence and Law, pages          Retrieval. In Neu-IR: The SIGIR 2016 Workshop on
     69–78. ACM, 2015.                                             Neural Information Retrieval, volume abs/1606.06991,
 [8] G. Grefenstette. Explorations in automatic thesaurus          Pisa, Italy, July 2016.
     discovery, volume 278. Springer Science & Business       [21] J. Pennington, R. Socher, and C. D. Manning. Glove:
     Media, 1994.                                                  Global vectors for word representation. In EMNLP,
 [9] Z. S. Harris. Distributional structure. Word,                 volume 14, pages 1532–1543, 2014.
     10(2-3):146–162, 1954.                                   [22] P. Resnik. Wordnet and distributional analysis: A
[10] J. Karlgren and M. Sahlgren. From words to                    class-based approach to lexical discovery. In AAAI
     understanding. Foundations of Real-World                      workshop on statistically-based natural language
     Intelligence, pages 294–308, 2001.                            processing techniques, pages 56–64, 1992.
[11] D. Kiela, F. Hill, and S. Clark. Specializing word       [23] M. Riedl and C. Biemann. Scaling to large3 data: An
     embeddings for similarity or relatedness. In                  efficient and effective method to compute
     L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and         distributional thesauri. In Conference on Empirical
     Y. Marton, editors, EMNLP, pages 2044–2048. The               Methods on Natural Language Processing, pages
     Association for Computational Linguistics, 2015.              884–890, 2013.
[12] J. Landthaler, B. Waltl, P. Holl, and F. Matthes.        [24] S. Rothe and H. Schütze. Autoextend: Extending
     Extending full text search for legal document                 word embeddings to embeddings for synsets and
     collections using word embeddings. In F. Bex and              lexemes. In Proceedings of the 53rd Annual Meeting of
     S. Villata, editors, Legal Knowledge and Information          the Association for Computational Linguistics and the
     Systems - JURIX 2016: The Twenty-Ninth Annual                 7th International Joint Conference on Natural
     Conference, volume 294 of Frontiers in Artificial             Language Processing (Volume 1: Long Papers), pages
     Intelligence and Applications, pages 73–82. IOS Press,        1793–1803, Beijing, China, July 2015. Association for
     2016.                                                         Computational Linguistics.
[13] O. Levy and Y. Goldberg. Neural word embedding as        [25] H. Rubenstein and J. B. Goodenough. Contextual
     implicit matrix factorization. In Z. Ghahramani,              correlates of synonymy. Communications of the ACM,
     M. Welling, C. Cortes, N. D. Lawrence, and K. Q.              8(10):627–633, Oct. 1965.
     Weinberger, editors, Advances in Neural Information      [26] P. Rychlý and A. Kilgarriff. An efficient algorithm for
     Processing Systems 27, pages 2177–2185. Curran                building a distributional thesaurus (and other sketch
     Associates, Inc., 2014.                                       engine developments). In Proceedings of the 45th
[14] Q. Lu and J. G. Conrad. Next generation legal                 Annual Meeting of the ACL on Interactive Poster and
     search-it’s already here. Vox Populii blog, Legal             Demonstration Sessions, ACL ’07, pages 41–44,
     Information Institute (LII), Cornell University, 2013.        Stroudsburg, PA, USA, 2007. Association for
[15] S. McDonald and M. Ramscar. Testing the                       Computational Linguistics.
     distributional hypothesis: The influence of context on   [27] K. Sparck Jones. Synonymy and semantic
     judgements of semantic similarity. In Proceedings of          classification. PhD thesis, University of Edinburgh,
     the 23rd annual conference of the cognitive science           1964.
     society, pages 611–616, 2001.                            [28] H. Zamani and W. B. Croft. Embedding-based query
[16] T. Mikolov, K. Chen, G. Corrado, and J. Dean.                 language models. In Proceedings of the 2016 ACM
     Efficient estimation of word representations in vector        International Conference on the Theory of
     space. CoRR, abs/1301.3781, 2013.                             Information Retrieval, ICTIR ’16, pages 147–156, New
[17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and         York, NY, USA, 2016. ACM.