Probing a Semantic Dependency Parser for
         Translational Relation Embeddings

              Riley Capshaw, Marco Kuhlmann, and Eva Blomqvist

                      Linköping University, Linköping, Sweden
             {riley.capshaw, marco.kuhlmann, eva.blomqvist}@liu.se


        Abstract. Translational relation models are primarily applied to the
        task of Knowledge Graph embedding. We present a structural probe for
        testing whether a state-of-the-art semantic dependency parser learns con-
        textualized word representations which fit a translational relation model.
        We find that the parser does not explicitly learn a translational relation
        model. We do, however, find that a simple transformation of the word
        representations is enough to induce a TransE model with 73.45% label
        recall, indicating that translational relation models are at least implicitly
        learned by the parser. We believe that our findings can in the future be
        used to develop new Natural Language Understanding systems that are
        more useful for Knowledge Graph generation and completion.

        Keywords: Knowledge Graph Embedding · Deep Learning · Natural
        Language Understanding · Semantic Dependency Parsing


1     Introduction
A Knowledge Graph (KG) represents entities as nodes and relations as labeled,
directed edges (s, p, o), indicating that a specific predicate p holds between the
subject s and the object o. A knowledge graph embedding projects a KG into a
high-dimensional vector space, often with geometric constraints enforced on the
learned representations. For example, TransE [4] maps s and o to n-dimensional
vectors h, t ∈ Rn and maps p to a translation vector r such that h + r ≈ t when-
ever (s, p, o) holds. This type of geometric representation makes models highly
interpretable due to the ease of explaining why relations are identified. This in-
terpretability is a desirable feature of models in many domains. For example,
geometric interpretations of word embeddings have helped to identify and han-
dle gender bias in language models [3]. In this paper we use a structural probe to
examine whether Natural Language Processing (NLP) models for relation pre-
diction explicitly represent relations as translation operations. Our experiments
focus on the task of semantic dependency parsing due to its structural and con-
ceptual similarities with KG completion from natural language. We argue that
in the future, in addition to studying the internal representation of an NLP com-
ponent, such a probe can be used to identify steps in NLP pipelines that can be
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2        R. Capshaw et al.

harmonized with models used in downstream KG completion systems, which is
shown to enhance the usefulness of NLP concepts from those steps [16].
    This paper is structured as follows. Section 2 discusses necessary background
information. Section 3 describes the experimental setup for probing a semantic
dependency parser. Section 4 presents and analyzes the probing results. Finally,
we discuss our conclusions and their implications for future work in Section 5.

2     Background and Related Work
The Semantic Web community has a long tradition of combining methods from
NLP and Machine Learning to support various tasks related to ontologies, Linked
Data, and lately the broader notion of KGs. For instance, early work on formal-
izing natural language to automate ontology learning heavily relied on classical
NLP pipelines (e.g. FRED [5]), while more recent work also applied Deep Learn-
ing to similar tasks [2]. For a comprehensive introduction to KG creation from
text (e.g. using NLP and Information Extraction techniques), we refer to the re-
cent tutorial paper by Hogan et al. [8]. In this section we specifically target KG
embeddings, then introduce semantic dependency parsing, and finally introduce
the concept of probing neural networks, to set the stage for our experiments.

Knowledge Graph Embedding techniques have been developed to encode KGs
into continuous vector spaces. KG embedding models which consider only en-
tities and relations can be broadly categorized as either translational models
(a.k.a. translational distance models) or semantic matching models [15]. Seman-
tic matching models (e.g. RESCAL [11]) are not as relevant for this study due
to not enforcing geometric interpretations for relation embeddings. Translational
models primarily embed relations as vectors corresponding to translation oper-
ations. TransE [4] is the simplest such model, though it has difficulty modeling
relationships which are not 1-to-1. Improvements upon TransE model more com-
plex relationships as translation operations by contextualizing entity representa-
tions per relation (e.g. TransR [10]) or relaxing constraints (e.g. ManifoldE [17]).

Semantic Dependency Parsing is a task within NLP where individual sentences
are annotated with binary dependency relations that encode shallow semantic
phenomena between words [13]. While conceptually similar to semantic role la-
beling [6], semantic dependency parsing instead covers all content words to form
a semantic dependency graph (SDG). A semantic dependency parser is a system
that produces a SDG for a sentence annotated with linguistic features. Most
parsers internally generate contextualized vector representations for each word,
but to our knowledge, none provide easily recoverable relation representations,
nor has their relation to translational models been studied prior to this work.
    For this study we use the data from two SemEval shared tasks on semantic
dependency parsing [13,14], where sentences from the Penn Treebank were an-
notated with three target semantic dependency representations. We only use the
English DELPH-IN MRS (DM) [12] subset due to it being released publicly1 .
1
    Available at: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1956
                                   Probing for Translational Relation Embeddings       3

                                    top
                                        ARG1                 ARG2
                BV         ARG1       ARG2                          poss

          The    results    were    in    line   with   analysts    ’   expectations

                Fig. 1. Example SDG #22007003 from the DM dataset.


Figure 1 shows a DM SDG. Note how it focuses on argument relationships (e.g.
ARG1 and ARG2), such as ‘results’ and ‘line’ being arguments for the predicate
‘in’, and how semantically vacuous words like ‘were’ are left disconnected. For
this example, one triple in our resulting SDG would be (in, ARG2, line).

Neural Network Probes were presented by Alain and Bengio [1] as a way of
analyzing the intermediate layers of a deep neural network without influencing
its training. They argue that convex classifiers (e.g., softmax classifiers trained by
minimizing cross-entropy) can be used to approximate the upper bound on linear
separability of features. Hence, we use linear probes to bound our experiments
and to give a point of comparison due to a lack of directly comparable related
work. Our experiments instead use structural probes, which analyze the structure
of a learned vector space. For example, Hewitt and Manning [7] probe language
models for syntactic structure by using `2 distance as an indication of parse tree
distance. This differs from our work in that we also consider the direction of
difference vectors to be informative.


3    Experimental Setup
We probe the semantic dependency parser developed by Kurtz et al. [9]. The
parser first contextualizes every word of an input sentence using three layers of
bidirectional long short-term memory (BiLSTM) units. Then, it scores edges as
h>                                                       > [r]
  i Udj , where U is a rank-three tensor such that hi U dj is the scalar score
for relation r from word i (specialized as the head) to word j (specialized as the
tail or dependent). We use the parser to predict the SDG for every sentence in
the training and testing sets from the SemEval shared task DM data. For every
predicted dependency r between pairs of words h and t, we record the respective
m-dimensional representations h, t ∈ Rm generated by each BiLSTM layer as
the triple (h, r, t). The resulting training set has 600,871 triples with 50 distinct
relations, while the testing set has 24,719 triples with only 39 distinct relations.
    To establish an upper bound for recall using only h and t, we formulate a
simple linear classifier probe as in Alain and Bengio [1]. Given x as a combination
of h and t, we train softmax (Wx + b) by minimizing cross-entropy, where W
and b are the weights and biases to learn. By setting x = t − h, we can directly
evaluate the translation vector. We contrast this with concatenation (x = [t; h]),
which preserves all input features but enlarges W, and addition (x = h + t).
    The structural probes are formulated nearly identically to translational em-
bedding models. We first define a linear transformation matrix M ∈ Rm×m to be
4       R. Capshaw et al.

learned. Then, for every triple (h, r, t) we normalize Mh and Mt to unit length,
then calculate r = Mt − Mh, where r is the vector representing a particular
instance of r as a translation operation. Since a perfect model would result in r
being equal for all triples with the same r, we also define a representative vector
rr for each r such that rr − r ≈ 0. We then approximate all rr and M through
gradient descent by minimizing the scoring function
                                                         2
                            fr (h, t) = kMh + rr − Mtk2 ,                          (1)
          2
where k·k2 is the squared `2 norm. We do not enforce strict translation and
instead define a margin γ such that fr (h, t) ≤ γ. Since a given h and t can have
only one relationship, we enforce fr0 (h, t) > γ for all r0 6= r to maximize the
separation of regions corresponding to relations. We also encourage all r for a
                                                                          2
given relation to have roughly the same direction with the constraint krk2 > 2γ.
Incorporating these gives us the following margin-based loss function:
                                 X                      h           i
                                                                2
          L = fr (h, t) − γ + +       γ − fr0 (h, t) + + 2γ −krk2      ,      (2)
                                                                         +
                                   r 0 6=r

where [·]+ is equivalent to max (0, ·). In order to see if a translational model is ex-
plicitly learned by the parser, we also define a probe where the input vectors are
not transformed: r = t − h. Finally, to test whether the classifiers actually fit a
relational model, we recalculated both structural probes’ scores without retrain-
ing by considering a prediction to be correct only if the constraint fr (h, t) ≤ γ
was satisfied. For all experiments, we set γ = 0.25 and trained for 100 epochs.
    All scores are reported as micro-averaged recall. Since we are only interested
in the probes’ abilities to reconstruct the parser’s relation predictions, we do not
predict the absence of a relation between words. If we did, incorrect predictions
would only affect precision, correct predictions would affect neither precision nor
recall, and micro-averaged recall would no longer be equal to label accuracy.


4    Results and Analysis
This section presents and analyzes the results from the experiments outlined in
Section 3. Table 1 presents all recall scores broken down by probe types, where
‘Constrained’ refers to structural probes where the margin constraint is strictly
enforced at prediction time. As a baseline for comparison, guessing only the most
frequent relation (ARG1) scores 37.32%. Most probes outperform this baseline,
and the scores increase consistently with layer depth for all but one probe.
    For the linear probes (L1 to L3 ), we see the highest overall recall achieved by
the concatenation probe (L3 ), likely due to preserving the most input informa-
tion. It outperforms the difference probe (L2 ) by 1.73, which in turn outperforms
the addition probe (L1 ) by 5.01. This indicates that the translation vectors al-
ready encode much of the information necessary for classification.
    For the structural probes, probe S1 (which does not transform the input
vectors) still outperformed the baseline, yet its scores for almost none of the
                              Probing for Translational Relation Embeddings           5


     Category        ID Probe              Layer 0 Layer 1 Layer 2 Layer 3
     Linear          L1   W[h + t] + b        66.73     81.34     88.85     91.01
                     L2   W[t − h] + b        67.82     85.18     94.07     95.89
                     L3   W[h; t] + b         72.69     89.06     96.54     97.52
     Structural      S1   h + rr − t          38.87     48.44     56.73     60.09
                     S2   Mh + rr − Mt        60.37     76.88     86.96     90.76
     Constrained     C1   h + rr − t           0.82      0.89      0.84      0.04
                     C2   Mh + rr − Mt        35.62     56.37     65.10     73.57

Table 1. Scores for all probing experiments in terms of recall as a function of layer.


triples satisfied the margin constraint (C1 ). This shows that it did not actually
fit a translational model. Instead, it likely defaulted to a linear partitioning of the
vector space akin to a multiclass perceptron. Probe S2 performed on par with
L1 and still performed well when the margin was enforced at prediction time
(C2 ), indicating that it did learn a translational model. We also analyzed the
effect of relaxing γ at prediction time by increasing it until all fr (h, t) ≤ γ 0 . For
probe C1 this occurred at γ 0 = 5γ = 1.25, and for probe C2 this occurred much
sooner at γ 0 = 2.4γ = 0.6. This indicates that the margin may need adjustment
at prediction time. Recall that if the margin was satisfied for some part of the
loss, then the loss for that part was zero.
     When analyzing the class distribution of the probes’ predictions, we noticed
that all three linear probes completely missed five relations, indicating an in-
ability to handle imbalanced classes. We found instead that probes S2 and C2
missed two relations and probes S1 and C1 missed only one. Despite having a
lower overall recall, both structural probes seem to capture rare relations better.


5    Conclusions and Future Work
In this work, we explored the relationship between NLP and KGs by analyz-
ing the structure of the vector space learned by a semantic dependency parser.
Applying KG techniques to NLP tasks is an attempt to further harmonize the
two areas, allowing for continuous representations of NLP concepts to be more
naturally incorporated into deep-learning-based KG generation pipelines. Such
representations capture information important for label predictions which a KG
generation system would be unable to learn on its own. To do this exploration,
we presented two structural probes inspired by translational embedding models.
We found that the semantic dependency parser we probed does not explicitly
learn a translational model. However, a single linear transformation matrix is
sufficient to fit such a model to the parser’s contextualized word representations
with 73.45% recall, implying that the parser is implicitly learning a translational
model. A promising path forward is then to implement a decoder for a seman-
tic dependency parser based on translational relation models to yield explicit,
6       R. Capshaw et al.

interpretable relation vectors alongside the full SDG. These can then be used in
a KG completion system based on translational embeddings (of which there are
many [15]), or in a full Natural Language Understanding pipeline which jointly
trains all components, from NLP subsystems to the KG generation itself.


Acknowledgments
This research work was funded in part by CUGS (the National Graduate School
in Computer Science, Sweden).


References
 1. Alain, G., Bengio, Y.: Understanding intermediate layers using linear classifier
    probes. In: Proc. of ICLR. OpenReview.net (2017)
 2. Arguello Casteleiro, M., et al.: Deep learning meets ontologies: experiments to
    anchor the cardiovascular disease ontology in the biomedical literature. Journal of
    Biomedical Semantics 9(1) (2018)
 3. Bolukbasi, T., et al.: Man is to computer programmer as woman is to homemaker?
    Debiasing word embeddings. In: Proc. of NeurIPS. Springer (2016)
 4. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translating
    embeddings for modeling multi-relational data. In: Proc. of NeurIPS (2013)
 5. Gangemi, A., et al.: Semantic web machine reading with FRED. Semantic Web
    8(6) (2017)
 6. Gildea, D., Jurafsky, D.: Automatic labeling of semantic roles. Computational lin-
    guistics 28(3) (2002)
 7. Hewitt, J., Manning, C.D.: A structural probe for finding syntax in word repre-
    sentations. In: Proc. of NAACL. ACL (2019)
 8. Hogan, A., et al.: Knowledge graphs (2020), https://arxiv.org/abs/2003.02320
 9. Kurtz, R., Roxbo, D., Kuhlmann, M.: Improving semantic dependency parsing
    with syntactic features. In: Proc. of the First NLPL Workshop on Deep Learning
    for Natural Language Processing. Linköping University Electronic Press (2019)
10. Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings
    for knowledge graph completion. In: Proc. of AAAI. AAAI Press (2015)
11. Nickel, M., Tresp, V., Kriegel, H.P.: A three-way model for collective learning on
    multi-relational data. In: Proc. of ICML. Omnipress (2011)
12. Oepen, S., Lønning, J.T.: Discriminant-based MRS banking. In: Proc. of LREC
    (2006)
13. Oepen, S., et al.: Semeval 2014 task 8: Broad-coverage semantic dependency pars-
    ing. In: Proc. of the 8th Int. Workshop on Semantic Evaluation. ACL (2014)
14. Oepen, S., et al.: Semeval 2015 task 18: Broad-coverage semantic dependency pars-
    ing. In: Proc. of the 9th Int. Workshop on Semantic Evaluation. ACL (2015)
15. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: A survey of
    approaches and applications. IEEE TKDE 29(12) (2017)
16. Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph and text jointly embed-
    ding. In: Proc. of EMNLP. ACL (2014)
17. Xiao, H., Huang, M., Zhu, X.: From one point to a manifold: Knowledge graph em-
    bedding for precise link prediction. In: Proc. of IJCAI. IJCAI/AAAI Press (2016)