<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Vecsigrafo: Portable Semantics in Knowledge-based Text Analytics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ronald Denaux</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Manuel Gomez-Perez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Expert System</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The proliferation of knowledge graphs and recent advances in Arti cial Intelligence have raised great expectations related to the combination of symbolic and distributional semantics in cognitive tasks. This is particularly the case of knowledge-based approaches to natural language processing as near-human symbolic understanding and explanation rely on expressive structured knowledge representations that tend to be labor-intensive, brittle and biased. This paper reports research addressing such limitations by capturing as embeddings the semantics of both words and concepts in large document corpora. We show how the emerging knowledge representation { our Vecsigrafo { can drive semantic portability capabilities that are not easily achieved by either word embeddings or knowledge graphs on their own, supporting curation, overcoming modeling gaps, enabling interlinking and multilingualism. In doing so, we also share our experiences and lessons learned and propose new methods that provide insight on the quality of such embeddings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        For several decades, semantic systems have been predominantly developed around
knowledge graphs (KGs) (and their variants: semantic networks and
ontologies) at di erent degrees of expressivity. Language technologies, particularly
knowledge-based text analytics have heavily relied on such structured
knowledge. Through the explicit representation of knowledge in well-formed, logically
sound ways, KGs provide rich, expressive and actionable descriptions of the
domain of interest through logical deduction and inference, and support
logical explanations of reasoning outcomes. On the downside, KGs can be costly
to produce leading to scalability issues, as they require a considerable amount
of well-trained human labor [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to manually encode knowledge in the required
formats. Capturing the knowledge from the crowd has been suggested[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], but
scalability is proportional to the number of humans contributing to such tasks.
Furthermore, the involvement of humans in the modelling activities introduces a
bias in how the domain is represented (its depth, breadth and focus) which can
lead to brittle systems that only work in a limited context, hinders generality
and may require continuous supervision and curation.
      </p>
      <p>In parallel, the last decade has witnessed a shift towards statistical
methods due to the increasing availability of raw data and cheap computing power.
Statistical approaches to text understanding have proved to be powerful and
convenient in many linguistic tasks, such as part-of-speech tagging, dependency
parsing, and others. However, these methods are also limited and cannot be
considered as a replacement for knowledge-based text analytics. E.g. humans seek
causal explanation which are hard to provide based on statistical methods, as
they are driven by statistical induction rather than logical deduction.</p>
      <p>
        Recent results in the eld of distributional semantics [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] have shown
promising ways to learn features from text that can complement the knowledge already
captured explicitly in KGs. Embeddings provide a compact and portable
representation of words and their associated semantics that stems directly from a
document corpus. Here, the notion of semantic portability refers to the
capability to capture as an information artifact (a vector) the semantics of a word from
its occurrences in the corpus and how such artifact enables that meaning to be
merged with other forms of (possibly structured) knowledge representation.
      </p>
      <p>At Expert System, we make extensive use of formal knowledge
representations, including semantic networks and linguistic rule bases, as these
technologies have shown higher accuracy gures if properly ne tuned, are more resilient
against scarce training data and tend to o er better inspection capabilities
allowing us to debug and adapt when necessary. However, this comes at the cost of
considerable human e ort by linguists and knowledge engineers for various KG
curation tasks : continuous bug xing, keeping resources up to date (e.g. Barack
Obama is/was the president of the US ), and adding extensions for new domains
or terms of interests (Cybersecurity, Blockchain) for each language supported.
We argue that semantic portability is a key feature to facilitate KG-curation
tasks, deal with modeling bias, and enable interlinking of KGs.</p>
      <p>In this paper, we present research and experiences evaluating, adopting and
adapting approaches for generating and applying word and concept embeddings.
Among others, we argue that using embeddings in practice (other than as inputs
for further deep learning systems) is not trivial as there is a lack of evaluation
tools and best practices other than manual inspection, which we are aiming to
minimize in the rst place. Therefore, we describe some methods we have
developed to check our intuitions about these systems and establish reproducible
good practices. Our main contributions are: i) a novel method for the generation
of semantically portable, joint word and concept embeddings and their
applications in hybrid knowledge-based text analytics (Sect. 3), ii) inspection and
evaluation methods that have proved useful for assessing the quality and tness
of embeddings for the purpose of our research (Sect. 4). This paper also applies
the embedding generation to Expert System's case, resulting in a Vecsigrafo and
describes a practical application for the Vecsigrafo (Sect. 5).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>NLP systems which perform Knowledge-based Text Analysis rely on a Knowledge
Graph (KG) as its point of reference for performing analyses. Good KGs for text
analysis represent concepts and entities, their semantic and grammatical
relations, and their lexical forms enabling the system to recognise and disambiguate
those concepts. KGs used in practice include DBpedia, Word- and BabelNet.</p>
      <p>The Knowledge Graph is used by a text analysis engine which performs
NLP tasks such as tokenization, part-of-speech tagging, etc. Furthermore, many
text analysis engines can use the knowledge encoded in the KG to perform word
sense disambiguation (WSD), in which particular senses are ascribed to words
in the text. This can be done for simple cases, such as knowing whether apple
refers to the company or the fruit; but this extends to more subtle di erences such
as distinguishing between redeem as \exchanging a voucher" or as \paying o a
debt" or its religious sense. The sense disambiguation results can then be used
to improve further NLP tasks such as categorization and information extraction;
this can be done either using machine learning or rule-based approaches.</p>
      <p>At Expert System our KG is called Sensigrafo1 that relates concepts
(internally called syncons) to each other via an extensible set of relations (e.g.
hypernym, synonym, meronym), to their lemmas (base forms of verbs, nouns,
etc.) and to a set of around 400 topic domains (e.g. clothing, biology, law,
electronics). For historical, strategic and organizational reasons we maintain di erent
Sensigrafos for the 14 languages we support natively; we have partial mappings
at the conceptual level between some (but not all) of our Sensigrafos as
producing and maintaining these mappings requires prohibitive amounts of human
e ort. The Sensigrafo is used by Cogito, our text analysis engine which
performs word sense disambiguation with an accuracy close to 90% for languages
with mature native support. On top of Cogito we have rule languages which
allow us to write (or learn) custom categorization and extraction rules based on
the linguistic characteristics (including disambiguated concepts) of documents.
2.1</p>
      <sec id="sec-2-1">
        <title>Word and KG embeddings</title>
        <p>
          Various approaches for statistical, corpus-based models of semantic
representations have been proposed over the last two decades [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Traditionally encoded
in sparse pointwise mutual information matrices, recent approaches generate
embeddings: dense, low-dimensional spaces that capture similar information as
the pointwise mutual information matrices. In particular, the word2vec system
based on the skip-gram with negative-sampling (SGNS) algorithm [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] provides
an e cient way to generate high-quality word embeddings from large corpora.
        </p>
        <p>
          Although SGNS is de ned in terms of sequences of words, subsequent
algorithms such as GloVe [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and Swivel [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] have shown that similar (or better)
results can be achieved by learning the vectors from sparse co-occurrence
matrices. These approaches make it easier to generalise over di erent types of linguistic
units (e.g. concepts and words) as we show in Sect. 3, which is harder to do with
SGNS since it expects non-overlapping sequences of linguistic units.
        </p>
        <p>
          Standard corpus-based word embeddings do not encode KG-concepts due
to ambiguity of words in natural language. One approach to resolve this is to
generate sense embeddings [
          <xref ref-type="bibr" rid="ref10 ref5">5,10</xref>
          ], whereby tags are added to the words in the
1 You can think of Sensigrafo as a highly curated version of WordNet
corpus to indicate the sense and part-of-speech of the word. While this addresses
ambiguity of individual words, the resulting embeddings do not directly provide
embeddings for KG-concepts, only to various synonymous word-sense pairs2.
        </p>
        <p>
          Approaches have been also proposed for learning concept embeddings from
existing Knowledge Graphs [
          <xref ref-type="bibr" rid="ref16 ref4 ref8">4,8,16</xref>
          ]. Compared to corpus-based embeddings, We
nd that KG-derived embeddings are not yet as useful for our purposes because:
(i) KG embeddings encode knowledge which is relatively sparse (compared to
large text corpora); (ii) the original KG already is structured and is easy to query
and inspect; (iii) corpus-based models provide a bottom-up view of the linguistic
units and re ect how language is used in practice, as opposed to KGs, which
provide a top-down view as they have usually been created by human experts (e.g.
Sensigrafo, which has been hand-curated by linguists). Other proposed methods
can generate joint embeddings of words and KG entities [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. However, they mix
bottom-up and top-down views3 which we want to avoid. Hence, in Section 3,
we propose a corpus-based, joint word-concept embedding generation method.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Existing Methods and Practices for Evaluating Embeddings</title>
        <p>
          One evaluation method used in the literature and open-sourced tools relies on
manual inspection: papers provide examples using a top-n of similar words for
a given input word to show semantic clustering. Similarly, visual inspection is
provided in the form of t-SNE or PCA dimensionality reduction projections into
two dimensions which also show semantic clustering; an example of a tool that
can be used for this purpose is the Embeddings Projector [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], distributed as
part of Tensor ow. However, these tools restrict the view to the neighbourhood
of a single point or area of the embedding space and make it hard to understand
the overall behaviour of the embeddings. In Sections 4.2 and 4.3 we introduce a
plot which addresses this issue and show concrete applications.
        </p>
        <p>
          Intrinsic evaluation methods are used to try to understand the overall
quality of embeddings. In the case of word embeddings, a few papers employ
such methods to provide systematic evaluations of various models [
          <xref ref-type="bibr" rid="ref12 ref2">2,12</xref>
          ]. As part
of these evaluations [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] de nes 5 types of tasks (and lists available test-sets) that
compare embedding predictions to human-rated datasets. From the ve identi ed
types, two (semantic relatedness and analogy) are consistently used in the recent
literature, while the other three (synonym detection, concept categorization and
selectional preference) are rarely used. Such intrinsic evaluations de ne speci c,
somewhat arti cial tasks which are not end-goals of the embeddings. Typically,
embeddings are generated to be used within larger (typically machine-learning
based) NLP systems [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]; improvements on those NLP systems by using the
embeddings provide extrinsic evaluations, which are rarely mentioned in the
literature. Schnabel et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] show that good results in intrinsic evaluations
do not guarantee better extrinsic results. In Sections 4 and 5 we present both
intrinsic and extrinsic evaluations.
2 E.g. word-sense pairs appleN2 and Malus pumilaN1 have separate embeddings, but the
concept for apple tree they represent has no embedding.
3 By aligning a TransE-like knowledge model and a SGNS-like text model
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Corpus-based Joint Concept-Word Embeddings</title>
      <p>In order to build hybrid systems which can use both bottom-up (corpus-based)
embeddings and top-down (KG) knowledge, we propose to generate embeddings
which share the same vocabulary as the Knowledge Graphs. This means
generating embeddings for knowledge items represented in the KG such as concepts and
surface forms (words and expressions) associated to the concepts in the KG4.</p>
      <p>
        The overall process for learning joint word and concept embeddings is
depicted in Figure 1, we start with a text corpus on which we apply
tokenization and word sense disambiguation (WSD) to generate a disambiguated corpus,
which is a sequence of lexical entries (words, or multiword expressions). Some
of the lexical entries are annotated with a particular sense (concept) formalised
in the KG. To generate embeddings for both senses and lexical entries, we need
to correctly handle lexical entries which are associated to a sense in the KG,
hence we extend the matrix construction phase of the Swivel [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] algorithm to
generate a co-occurrence matrix which includes both lexical forms and senses as
part of the vocabulary as explained below. Then we apply the training phase of
a slightly modi ed version of the Swivel algorithm to learn the embeddings for
the vocabulary; the modi cation is the addition of a vector regularization term
as suggested in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (equation 5) which aims to reduce the distance between the
column and row (i.e. focus and context) vectors for all vocabulary elements.
      </p>
      <p>
        Modi ed Swivel Co-occurrence Matrix Construction The main
modi cation from standard Swivel5 is that in our case, each token in the corpus is not
a single word, but a lexical entry with an optional KG-concept annotation. Both
lexical entries and KG-concepts need to be taken into account when calculating
the co-occurrence matrix. Formally, the co-occurence matrix X 2 RV V contains
the co-occurrence counts found over a corpus, where V L[C is the vocabulary,
which is a conjunction of lexical forms L and KG-concepts C. Xij = #(vi; vj) is
the frequency of lexical entries or concepts vi and vj co-occurring within a
certain window size w. Note that Xij 2 R, since this enables us to use a dynamic
context window [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], weighting the co-occurrence of tokens according to their
distance within the sequences.6.
4 In RDF, this typically means values for rdfs:label properties, or words and
expressions encoded as ontolex:LexicalEntry instances using the lexicon model for
ontologies (see https://www.w3.org/2016/05/ontolex/).
5 As implemented in https://github.com/tensorflow/models/tree/master/swivel
6 We use a modi ed harmonic function h(n) = 1=n for n &gt; 0 and h(0) = 1 which
covers the case where a token has both a lexical form and a concept. This is the
3.1
      </p>
      <sec id="sec-3-1">
        <title>Vecsigrafo Generation at Expert System</title>
        <p>
          We follow the process
described in Fig. 1 by adapt- Table 1: Size of vocabularies( 1000), for English
ing it to Expert System's and Spanish in Sensi- and Vecsigrafos
technology stack (described in
Sect. 2). As input we have En-grafo Es-grafo
used the UN corpus [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], Vocab Element Sensi- Vecsi- Sensi-
Vecsispeci cally the alignment be- Lemmas 398 80 268 91
tween English and Spanish KG-Concepts 300 67 226 52
corpora, which has about 22 Total 698 147 474 143
million lines for each
language. We used Cogito to perform tokenization, lemmatization and word-sense
disambiguation. We use tokens at the disambiguation level, this means that some
tokens correspond to single words, while others correspond to multi-word
expressions (when they can be related to a Sensigrafo concept). Furthermore, we only
kept lemmas (or base-form) as our lexical entries since the Sensigrafo only
associates concepts to lemmas, not the various morphological forms in which they
can appear in natural language; this reduces the size of the vocabulary. Also,
we perform some ltering of the tokens by removing stopwords. We trained two
vecsigrafos for Spanish and English for 80 epochs. The resulting vecsigrafos are
summarised and compared to the corresponding Sensigrafos in Table 1. As the
table shows, the UN corpus only covers between 20 and 34 % of the lemmas and
concepts in the respective Sensigrafos.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluating Vecsigrafos</title>
      <p>
        In general, we nd that generating embeddings is relatively easy by using and
adapting existing tools. However, evaluating the quality of the resulting
embeddings is not as straightforward. We have used both manual and visual inspection
tools but ran into the issues discussed in Sect. 2.2. In particular, the Embeddings
Projector [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] is limited to displaying 10K points, hence we can only visualize a
part of the vecsigrafo at a time. By combining information from Sensigrafo and
exploring areas of the space with this projector, we have been able to nd some
vocabulary elements which were \out of context"; this was typically caused either
by a limitation of the corpus, or by issues with our language processing pipeline
(e.g. tokenization or disambiguation), which could then be further investigated.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Semantic Relatedness</title>
        <p>Testing on various semantic relatedness gives us results which are well below
the state of the art as shown in Table 2. Part of this is due to the corpus used,
which is smaller and more domain restricted than other corpora (e.g. Wikipedia,
same weighing function used in GloVe and Swivel; word2vec uses a slightly di erent
function d(n) = n=w.
Gigaword): when we apply standard Swivel on the UN corpus we see a
substantial decrease in performance. Furthermore, the lemmatization and inclusion of
concepts in the vocabulary may introduce noise and warp the vector space
negatively impacting the results for this speci c task.</p>
        <p>Although these results are disappointing, we note that (i) the results only
provide information about the quality of the lemma embeddings but not the
concept embeddings; (ii) an easy way to improve these results is to train a
Vecsigrafo on a larger corpus. Also, most of the available datasets are only available
for English and we could not nd similar datasets for Spanish.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Word Prediction Plots</title>
        <p>To address the limitations of top-n neighbour, visual exploration and semantic
relatedness approaches discussed above, we developed a method to generate plots
that can give us an overview of how well the embeddings perform across the entire
vocabulary. We call these word prediction plots.</p>
        <p>The idea is to simulate the original word2vec learning objective {namely
predicting a focus word based on its context words (or vice versa){ while
gathering information about elements in the vocabulary. To generate the plot, we
need a test corpus, which we tokenize and disambiguate in the same manner as
during vecsigrafo generation. Next, we iterate the disambiguated test corpus and
for each token, we calculate the cosine similarity between the embedding of the
focus token and the weighted average vector of the context vocabulary elements
in the window. After iterating the test corpus we have, for each vocabulary
element in the test corpus, a list of cosine similarity values for which we can derive
statistics such as the average, standard deviation, minimum and maximum.</p>
        <p>The plots are con gurable as you can vary: (i) the statistic to display, e.g.
average, minimum; (ii) how to display vocabulary elements(e.g. vary order or
colour); (iii) versions of generated embeddings; (iv) test corpora (e.g. Wikipedia,
Europarlament); (v) the window size and dynamic context window weighting
scheme (see Sect. 3) (vi) the size of the test corpus: the larger the corpus the
slower it is to generate the cosine similarities, but the more lexical entries will
be encountered. Plot generation takes, on a typical PC, a couple of minutes for
a corpus of 10K lines and up to half an hour for a corpus of 5M lines.</p>
        <p>These types of plots are useful to verify the quality of embeddings and to
explore hypotheses about the embedding space as we show in Figure 2. Fig. 2a
shows a plot obtained by generating random embeddings for the English
vocabulary; as can be expected the average cosine similarity is close to zero for the
(a) Random embeddings (2M, 10)
(b) Buggy correlations (5M, 10)
(c) Uncentered (5M, 4)
(d) Re-centered (10K, 5)
Fig. 2: Example word prediction plots, number of sequences of the test corpus
used and the context window size. The horizontal axis shows the rank of the
vocabulary elements sorted from most to least frequent; the vertical axis shows
the average cosine similarity (which can range from -1 to 1).
150K lexical entries in the vocabulary. Fig. 2b shows a plot for early embeddings
we generated where we had a bug calculating the correlation matrix. Manual
inspection of the embeddings seemed reasonable, but the plot clearly shows that
only the most frequent 5 to 10K vocabulary elements are consistently correct; for
most of the remaining vocabulary elements the predictions are not better than
random (although some of the predictions were good). Fig. 2c shows results for
a recent version of the Spanish vecsigrafo, which are clearly better than random;
although overall values are rather low. Fig. 2d shows the plot for our current
embeddings where the vector space is re-centered as explained next.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Vector Distribution and Calibration</title>
        <p>One of the things we noticed using the prediction plots was that, even after
xing bugs with the co-occurrence matrix, there seemed to be a bias against the
most frequent vocabulary elements, as shown in g. 2c, where it seems harder to
predict the most frequent words based on their contexts. We formulated various
hypotheses to try to understand why this was happening, which we investigated
by generating further plots. A useful plot in this case was generated by
calculating the average cosine similarity between each vocabulary element and a
thousand randomly generated contexts. If the vector space is well distributed, we
expected to see a plot similar to g. 2a. However, the result depicted in gure 3a
veri es the suspected bias by showing that given a random context, the vector
space is more likely to predict an infrequent word rather than a frequent one.</p>
        <p>To avoid this bias, we can recalibrate the embedding space as follows: we
calculate the centroid for all the vocabulary elements and then shift all the
vectors so that the centroid becomes the origin of the space. When generating
the random contexts plot again using this re-centered embedding space, we get
the expected results. Figures 2c and 2d, show that this re-centering also improves
the prediction of the most frequent lexical entries.</p>
        <p>(a) Original (uncentered)</p>
        <p>(b) Re-centered</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Vecsigrafo Application: Cross-KG Concept Alignment</title>
      <p>As mentioned in Sect. 2, there are many tasks currently requiring manual e ort,
where a Vecsigrafo can be useful. In this section we discuss one such application:
cross-KG alignment of concepts. Sensigrafos for di erent languages have been
modelled by di erent teams and t di erent strategic needs, hence they di er in
terms of maturity and conceptual structure (besides the linguistic di erences).
This provides a use-case for semantic portability as, in order to support
crosslinguality, we need to be able to map concepts between di erent Sensigrafos as
accurately as possible. We describe how we apply vecsigrafos to accomplish this.</p>
      <sec id="sec-5-1">
        <title>Mapping Vector Spaces We followed Mikolov</title>
        <p>
          et al.[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] approach to generate embeddings for di er- Table 3: Alignment
ent languages and then aligning the di erent spaces. method performance
To do this we need a seed dictionary between the
vocabularies, which we had in the form of a partial Method Nodes Hit@5
mapping (for 20K concepts) between the Spanish and TM n/a 0.36
English sensigrafos. We expanded the partial concept NN2 4K 0.61
mapping to generate a dictionary for lemmas (cover- NN2 5K 0.68
ing also around 20K lemmas). We split this dictionary NN2 10K 0.78
into a training, validation and test set (80-10-10) and NN3 5K 0.72
tried a couple of methods to derive an alignment
between the spaces, summarised in Table 3. Although
Mikolov suggests using a simple linear translation matrix (TM) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], we found
this method had very poor results. This suggests the desired alignment between
the embedding spaces is highly non-linear7, which prompted us to use neural
7 We explored and veri ed this in a number of experiments not reported in this paper.
networks (NN) with ReLU activations to capture these non-linearities. The best
NN was able to include the correct translation for a given lexical entry in the
top-5 of nearest neighbours in 78% of cases in our test set. In 90% of the cases we
found that the top 5 suggested translations were indeed semantically close. The
results indicate that the dictionary we used to train the alignment models covers
the low ambiguity cases where straightforward ontology alignment methods can
be applied.
        </p>
        <p>Combining embedding and KG infor- Table 4: Manual inspection of
mation The results shown in Table 3 are very bilingual embeddings
encouraging, hence we used them to generate
a bi-lingual vecsigrafo. To check how well
the bi-lingual vecsigrafo generalises from the in dict out dict
dictionary, we took a sample of 100 English # concepts 46 64
concepts and analysed their top5 neighbours hit@5 0.72 0.28
in Spanish as shown in Table 4. The hit@5 no conceptes 2 33
for the in-dictionary is in line with our test
results; but for the out-of-dictionary concepts, we could only manually nd an
exact synonym in 28% of the cases. Manual inspection showed that for over half
of the concepts, the corresponding Spanish concept had not been included in the
vecsigrafo, or there was no exact match in the Spanish Sensigrafo. Furthermore,
as Table 1 shows, the Spanish sensigrafo has 75K fewer concepts than English
and due to modelling and language di erences, many concepts may be
fundamentally unmappable8. In conclusion: the bi-lingual vecsigrafo can help us nd
missing mappings, but still requires manual validation as it does not provide a
solution to the underlying problem of nding exactly synonymous concepts.</p>
        <p>
          Our next step was to design a hybrid synonym concept suggester which
combines features from the bi-lingual vecsigrafo, the information in the
Sensigrafos and PanLex [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], a multilingual thesaurus. In broad lines, the suggester
works as follows: for a given concept in the source language, we nd the n nearest
concepts in the target language that match the grammar type (i.e. they should
be either nouns, verbs, adjectives, etc.); next, for each candidate, we calculate
a set of hybrid features such as likelihood of lemma translation, glossa
similarity, absolute and relative cosine similarity, shared hypernyms and domains. We
then combine the various features into a single score, which we use to re-order
the candidates and decide whether we have found a possible exact synonym.
Finally, we verify whether the suggested synonym candidate is already mapped
to a di erent concept and if so, we calculate the same features for this pair and
compare it to the score for the top candidate.
        </p>
        <p>The output of the synonym suggester for an input is either (i) no suggestion or
(ii) a suggested synonym, which can be either clashing or non-clashing. Figure 4a
shows the output suggestions for 1546 English concepts, used in a large
rulebase for IPTC categorization, for which we did not have a Spanish mapping.
We manually inspected 30 of these cases to verify the suggestions. The results,
8 The size and scope of the UN corpus limits the concepts available in Vecsigrafo.
shown in Fig. 4, con rm that the suggestions are mostly
accurate. In fact, for 5 of the clashing cases, the
suggestion was better than the existing mapping; i.e. the existing
mapping was not an exact synonym. For another 4
clashing suggestions, the existing mapping had very close
meanings, indicating the concepts could be merged. In some
cases suggestions were not exact synonyms, but pointed
at modelling di erences between the sensigrafos. For ex- (a) 1546 suggestions
ample, for the English concept vote, a verb with glossa
go to the polls, the suggested Spanish synonym was
concept votar, a verb with glossa to express an opinion or
preference, for example in an election or for a referendum,
which is more generic than the original concept. However,
the Spanish Sensigrafo does not contain such a speci c
concept (among 5 verb concepts associated votar) and the (b) 16 no suggestion
English Sensigrafo does not contain an equivalent concept
(among 6 verb concepts associated to vote, plus 9
nonverb concepts).
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>This paper introduced a method for generating joint
wordconcept embeddings based on word sense disambiguation (c) 10 clashing
which can be combined with knowledge graphs into a
Vecsigrafo for providing hybrid capabilities that could not be
achieved by either the KG or word-embeddings alone. We
presented evaluation methods that we have used,
introducing a new kind of plot for assessing embedding spaces.</p>
      <p>Finally, we presented a practical application of Vecsigrafo
showing promising results. We believe these methods can
be employed to improve KGs and tools in the Semantic (d) 4 non-clashing
Web and Computational Linguistics communities.</p>
      <p>
        As future work, we intend to parallelise the presented Fig. 4: Synonym
pipelines for vecsigrafo and plot generation and apply suggestions for
these to larger corpora. We are also extending a rule trans- 1546 syncons and
lator [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] using Vecsigrafo to support new language com- manual inspection
binations; previously, translations were only possible be- breakdown.
tween fully-mapped, closely related languages (e.g. Italian
to Spanish). We are designing Vecsigrafo-powered tools
that can be used by our linguists to assist in Sensigrafo
curation and alignment tasks.
      </p>
      <p>Acknowledgements This work is supported by CDTI (Spain) as project
IDI-20160805 and by the European Comission under grant 700367 { DANTE {
H2020-FCT-2014-2015/H2020-FCT-2015.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aroyo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Truth is a lie: Crowd truth and the seven myths of human annotation</article-title>
          .
          <source>AI</source>
          Magazine
          <volume>36</volume>
          (
          <issue>1</issue>
          ),
          <volume>15</volume>
          {
          <fpage>24</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dinu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kruszewski</surname>
          </string-name>
          , G.:
          <article-title>Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors</article-title>
          .
          <source>In: ACL</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Distributional Memory: A General Framework for CorpusBased Semantics</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>36</volume>
          (
          <issue>4</issue>
          ),
          <volume>673</volume>
          {
          <fpage>721</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usunier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yakhnenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Translating Embeddings for Modeling Multi-Relational Data</article-title>
          .
          <source>Advances in NIPS 26</source>
          ,
          <issue>2787</issue>
          {
          <fpage>2795</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A Uni ed Model for Word Sense Representation and Disambiguation</article-title>
          . In: EMNLP. pp.
          <volume>1025</volume>
          {
          <issue>1035</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Denaux</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biosca</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Perez</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          :
          <article-title>Framework for Supporting Multilingual Resource Development at Expert System</article-title>
          . In: Meta-Forum. Lisbon (
          <year>2016</year>
          ), http: //www.meta-net.eu/events/meta-forum-2016/slides/31_denaux.pdf
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanayama</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Bird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Cohn</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Learning Crosslingual Word Embeddings without Bilingual Corpora</article-title>
          .
          <source>In: EMNLP-2016</source>
          . pp.
          <volume>1285</volume>
          {
          <issue>1295</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>GAKE: Graph Aware Knowledge Embedding</article-title>
          . In: COLING. pp.
          <volume>641</volume>
          {
          <issue>651</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gunning</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudhri</surname>
            ,
            <given-names>V.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>P.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaw</surname>
            ,
            <given-names>S.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greaves</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grosof</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leung</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mishra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <source>Others: Project Halo Update|Progress Toward Digital Aristotle. AI Magazine</source>
          <volume>31</volume>
          (
          <issue>3</issue>
          ),
          <volume>33</volume>
          {
          <fpage>58</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Iacobacci</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pilehvar</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , R.: SENSEMBED:
          <article-title>Learning Sense Embeddings for Word and Relational Similarity</article-title>
          .
          <source>In: 53rd ACL</source>
          . pp.
          <volume>95</volume>
          {
          <issue>105</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kamholz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pool</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colowick</surname>
            ,
            <given-names>S.M.:</given-names>
          </string-name>
          <article-title>PanLex: Building a Resource for Panlingual Lexical Translation</article-title>
          . In: LREC. pp.
          <volume>3145</volume>
          {
          <issue>3150</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dagan</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Improving Distributional Similarity with Lessons Learned from Word Embeddings</article-title>
          .
          <source>Transactions of the ACL</source>
          <volume>3</volume>
          (
          <issue>0</issue>
          ),
          <volume>211</volume>
          {
          <fpage>225</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>In: NIPS</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Exploiting Similarities among Languages for Machine Translation</article-title>
          .
          <source>Tech. rep., Google Inc</source>
          . (sep
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: EMNLP</source>
          . vol.
          <volume>14</volume>
          , pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>RDF2Vec: RDF graph embeddings for data mining</article-title>
          .
          <source>In: International Semantic Web Conference</source>
          . vol.
          <volume>9981</volume>
          LNCS, pp.
          <volume>498</volume>
          {
          <issue>514</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Schnabel</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Labutov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mimno</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Evaluation methods for unsupervised word embeddings</article-title>
          .
          <source>In: EMNLP</source>
          . pp.
          <volume>298</volume>
          {
          <fpage>307</fpage>
          .
          <string-name>
            <surname>ACL</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doherty</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waterson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Swivel: Improving Embeddings by Noticing What's Missing. arXiv preprint (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Smilkov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brain</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thorat</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicholson</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reif</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Viegas</surname>
            ,
            <given-names>F.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wattenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Embedding Projector: Interactive Visualization and Interpretation of Embeddings</article-title>
          .
          <source>In: Interpretable Machine Learning in Complex Systems</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Knowledge Graph and Text Jointly Embedding</article-title>
          .
          <source>EMNLP 14</source>
          ,
          <issue>1591</issue>
          {
          <fpage>1601</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Ziemski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Junczys-Dowmunt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pouliquen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The united nations parallel corpus v1. 0</article-title>
          . In:
          <article-title>Language Resource and Evaluation (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>