Hyperbolic Embedding for Finding Syntax in BERT
Temirlan Auyespek1 , Thomas Mach2 and Zhenisbek Assylbekov1
1
    Nazarbayev University, 53 Kabanbay Batyr Ave, Nur-Sultan, Kazakhstan
2
    University of Potsdam, Karl-Liebknecht-Str. 24–25, Potsdam, Germany


                                         Abstract
                                         Recent advances in natural language processing have improved our understanding of what kind of
                                         linguistic knowledge is encoded in modern word representations. For example, methods for testing
                                         the ability to extract syntax trees from a language model architecture were developed by Hewitt and
                                         Manning (2019)—they project word vectors into Euclidean subspace in such a way that the corresponding
                                         squared Euclidean distance approximates the tree distance between words in the syntax tree. This work
                                         proposes a method for assessing whether embedding word representations in hyperbolic space can better
                                         reflect the graph structure of syntax trees. We show that the tree distance between words in a syntax
                                         tree can be approximated well by the hyperbolic distance between corresponding word vectors.

                                         Keywords
                                         BERT, Poincaré ball, Structural probe


1. Introduction
Recent advances in natural language processing (NLP) such as contextualized word embed-
dings obtained from language models [1] gave significant advancements on natural language
understanding tasks. It is important to understand what kind of linguistic knowledge can
be encoded in these representations. There are several works that explore specific types of
linguistic knowledge, such as part-of-speech [2], morphology [3, 4], and syntax [5, 6, 7].
   On one hand the paper is inspired by [5] and [7], who proposed a method for recovering
syntactic dependencies under squared Euclidean distance and squared Poincaré distance respec-
tively. In this work we propose methods for extracting syntactic dependencies under Poincaré
distance without squaring. On the other hand the paper is motivated by the observation that
one cannot draw a tree in the Euclidean space with unit distance between all neighboring nodes
and without overlap, since there is not enough room for nodes, see Fig. 1 for a visualization
of the problem. Mathematically, one could argue that the number of nodes in a binary tree
expands faster than the Euclidean volume as the tree depth grows, i.e. there is always a 𝑘0 such
that

                                                                                   2𝑘 > (2𝑘)𝑑               for all 𝑘 > 𝑘0 ,

where 𝑑 is the dimension of the Euclidean space.
AIxIA 2021 Discussion Papers, December 1 - 3, 2021, Milan, Italy
$ temirlan.auyespek@nu.edu.kz (T. Auyespek); mach@uni-potsdam.de (T. Mach); zhassylbekov@nu.edu.kz
(Z. Assylbekov)
 https://github.com/TemirlanAuyespek (T. Auyespek)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
                                                               Conflict!
                             A
                                                  K        E      JH        G        I

                    B                C
                                                           B       A        C
           D                EF                G

                         Conflict!                         D                F

Figure 1: Trees cannot be drawn in the Euclidean space with constant distances between nodes.


2. Related work
A method, called structural probe, was proposed for extracting syntactic knowledge from word
representations [5]. The probe identifies a linear transformation suited to use the squared
Euclidean distances to represent the distance between words in the parse tree.
   A second method, called Poincaré probe, was proposed in [7] and projects word representations
into a Poincaré subspace for revealing linguistic hierarchies encoded in BERT. It can be asserted
that linguistic information contained in BERT may be encoded in special metric spaces that are
not necessarily Euclidean. The hyperbolic space model, in particular the Poincaré ball, is a good
candidate due its tree-likeness [8, 9].


3. Methods
We start by briefly introducing hyperbolic geometry following notation from [10]. Hyperbolic
geometry is a geometry with a constant negative curvature. There are five isometric models
[11] and we choose                     }︀ in [7]. The Poincaré ball with negative curvature 1 is
                  {︀ the Poincaré ball as
defined as D𝑛 = x ∈ R𝑛 | ‖x‖2 < 1 . The distance between two points u, v ∈ D𝑛 is given
by
                                                        ‖u − v‖2
                                          (︂                             )︂
                                      −1
                      𝑑D (u, v) = cosh      1+2                             .
                                                  (1 − ‖u‖2 )(1 − ‖v‖2 )
For projecting points to the Poincaré ball we consider two mappings called gnomonic mapping
and hyperboloid mapping denoted by 𝑔(·) and ℎ(·), respectively. Closed-form formulas are
                                   x                             x
                        𝑔(x) = √︀         ,       ℎ(x) =      √︀
                                 1 + ‖x‖2                  1 + 1 + ‖x‖2

Both of them map points from the Euclidean space to the unit ball, which is considered as
Poincaré ball. Additionally, we use the Möbius matrix-vector multiplication defined as
                                       (︂                   )︂
                                          ‖Mx‖      −1          Mx
                       M ⊗ x = tanh            tanh (‖x‖)             ,
                                           ‖x‖                 ‖Mx‖

which is the hyperbolic analogue of the Euclidean linear transformation.
   Our method consists of three steps as in [7], but with different ways of mapping into Poincaré
ball. The method is applied to word representations h1:𝑡 obtained from one of the BERT’s layers
[1] for a sentence 𝑤1:𝑡 consisting of words [𝑤1 , . . . , 𝑤𝑡 ] =: 𝑤1:𝑡 . The first step is applying a
linear transformation B : R𝑛 ↦→ R𝑚 , where 𝑛 is the dimension of word representations and 𝑚
is the embedding dimension. Using this step we receive a set of vectors

                                              x𝑖 = Bh𝑖

The second step is applying gnomonic or hyperboloid mapping for obtaining vector reprentations
in the Poincaré ball denoted as y𝑖 :

                               y𝑖 = 𝑔(x𝑖 )           or     y𝑖 = ℎ(x𝑖 )

The final step is applying Möbius matrix-vector multiplication M : D𝑚 ↦→ D𝑚 for obtaining
final vector representations denoted as z𝑖 :

                                             z 𝑖 = M ⊗ y𝑖


Figure 2: Illustration of our method.


  The matrices B and M are trained in such way that the hyperbolic distance 𝑑D (z𝑖 , z𝑗 )
resembles the graph distance 𝑑𝑇 (𝑤𝑖 , 𝑤𝑗 ) between 𝑤𝑖 and 𝑤𝑗 in a syntax tree. The training
objective for one sentence is
                                           1 ∑︁
                       ℓ(𝑤1:𝑡 ; B, M) :=        |𝑑𝑇 (𝑤𝑖 , 𝑤𝑗 ) − 𝑑D (z𝑖 , z𝑗 )|.                  (1)
                                           𝑡2
                                               𝑖,𝑗

Our approach is illustrated in Fig. 2.
4. Experiments
The main purpose of the performed experiments is to show that the usual (non-squared) Poincaré
distances can encode tree distances1 .

4.1. Setup
The training objective (1) is averaged over a set of sentences (corpus) and is minimized w.r.t.
B and M in the same way as in [7]. We use the Adam optimizer [12] with the learning rate
0.001. In this work we use the English Universal Dependencies dataset [13] for optimizing
(1). For evaluation of the performance we report Undirected Unlabeled Attachment Score
(UUAS) and average Spearman correlation (DSpr.). UUAS is the percentage of undirected edges
placed correctly against the syntax tree and DSpr. is the Spearman correlation between true
and predicted distances for each word in each sentence.

4.2. Results
Results for different embedding dimensionalities 𝑚, and for different layers of BERT are given in
Fig. 3, where we also show the results of the Poincaré probe from [7] trained without squaring
for comparison. Table 1 shows the best result per each method.


Figure 3: Probing results with respect to embedding dimensionality (left) and with respect to BERT’s
layer index.


Table 1
Best results per each method.
             Structural probe    Gnomonic mapping       Hyperboloid mapping       Exponential mapping
 UUAS              79.17                 78.51                   78.51                        78.82
 DSpr.             80.94                 83.79                   83.93                        83.96

   We can see that the newly proposed methods with gnomonic and hyperboloid mappings gave
results that are competitive to state-of-the-art. These methods outperform the structural probes
   1
       Results can be reproduced at https://github.com/TemirlanAuyespek/HyperbolicEmbedding
in lower dimensions. However, for dimensions higher than 16 the results become more and
more similar. Results of the exponential mapping from [7] without squaring show scores similar
to gnomonic and hyperboloid mappings, but there was a subsidence at embedding dimensions
16 and 32.
   Fig. 3 (right) shows results of UUAS for different BERT layers with embedding dimension
128, which we consider as a trade-off between performance and computational complexity.

4.3. Visualization
We visualize recovered dependency trees in Fig. 4 using PCA projection as [7]. The results of
the two methods have a similar structure, also very similar to the original syntax tree. There
can be a certain level of distortion, but there is no good analogy of PCA in hyperbolic space
[14].
  Along with the competitive performance of our method, another important contribution of our
work is the interpretability of the obtained Poincaré ball distances because they themselves—and
not their squares—approximate syntax tree distances.


         (a) True syntax tree
                                     (b) Gnomonic mapping         (c) Hyperboloid mapping


         (d) True syntax tree
                                     (e) Gnomonic mapping         (f) Hyperboloid mapping
Figure 4: Examples of mapping using new methods. Yellow lines/geodesics show the ground truth.
Blue dashed lines/geodesics show predicted dependencies and hyperbolic distances along these
lines/geodesics approximate the syntax tree distances. Background gray curves are geodesics of the
Poincaré ball.
5. Conclusion
In this work, we introduced two methods for embedding word representations in the hyperbolic
space model, specifically the Poincaré ball. These methods were able to recover syntactic
knowledge from word representation space. The obtained results are comparable to the results
by Chen et al. [7] and are sometimes better. More importantly, we showed that hyperbolic
distances can encode tree distances without any squaring. These results also confirm that the
hyperbolic spaces fit tree-structured data better than the Euclidean spaces.
   Future research will be dedicated to the investigation of other configurations of the BERT
model such as BERT Large and extracting other kinds of linguistic knowledge.


Acknowledgements
This work is supported by the Nazarbayev University faculty-development competitive research
grants program, grant number 240919FD3921. The authors would like to thank anonymous
reviewers for their feedback.


References
 [1] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of NAACL-HLT, 2019, pp.
     4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. doi:10.18653/v1/n19-1423.
 [2] Y. Belinkov, N. Durrani, F. Dalvi, H. Sajjad, J. R. Glass, What do neural machine translation
     models learn about morphology?, in: Proceedings of ACL, 2017, pp. 861–872. URL: https:
     //doi.org/10.18653/v1/P17-1080. doi:10.18653/v1/P17-1080.
 [3] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contextualized word representations, in: Proceedings of NAACL-HLT, 2018, pp. 2227–2237.
     URL: https://doi.org/10.18653/v1/n18-1202. doi:10.18653/v1/n18-1202.
 [4] M. E. Peters, M. Neumann, L. Zettlemoyer, W.-T. Yih, Dissecting contextual word embed-
     dings: Architecture and representation, in: Proceedings of EMNLP, 2018, pp. 1499–1509.
     URL: https://doi.org/10.18653/v1/d18-1179. doi:10.18653/v1/d18-1179.
 [5] J. Hewitt, C. D. Manning, A structural probe for finding syntax in word representations,
     in: Proceedings of NAACL-HLT, 2019, pp. 4129–4138. URL: https://doi.org/10.18653/v1/
     n19-1419. doi:10.18653/v1/n19-1419.
 [6] T. Linzen, E. Dupoux, Y. Goldberg, Assessing the ability of lstms to learn syntax-sensitive
     dependencies, Trans. Assoc. Comput. Linguistics 4 (2016) 521–535. URL: https://transacl.
     org/ojs/index.php/tacl/article/view/972.
 [7] B. Chen, Y. Fu, G. Xu, P. Xie, C. Tan, M. Chen, L. Jing, Probing BERT in hyperbolic spaces,
     CoRR abs/2104.03869 (2021). URL: https://arxiv.org/abs/2104.03869. arXiv:2104.03869.
 [8] M. Nickel, D. Kiela, Poincaré embeddings for learning hierarchical representations, in:
     Proceedings of NeurIPS, 2017, pp. 6338–6347. URL: https://proceedings.neurips.cc/paper/
     2017/hash/59dfa2df42d9e3d41f5b02bfc32229dd-Abstract.html.
 [9] R. Sarkar, Low distortion delaunay embedding of trees in hyperbolic plane, in: Graph
     Drawing, 2011.
[10] O. Ganea, G. Bécigneul, T. Hofmann, Hyperbolic neural networks, in: Proceedings
     of NeurIPS, 2018, pp. 5350–5360. URL: https://proceedings.neurips.cc/paper/2018/hash/
     dbab2adc8f9d078009ee3fa810bea142-Abstract.html.
[11] J. W. Cannon, W. J. Floyd, R. Kenyon, W. R. Parry, et al., Hyperbolic geometry, Flavors of
     geometry 31 (1997) 59–115.
[12] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Proceedings of ICLR,
     2015. URL: http://arxiv.org/abs/1412.6980.
[13] N. Silveira, T. Dozat, M. de Marneffe, S. R. Bowman, M. Connor, J. Bauer, C. D. Manning,
     A gold standard dependency corpus for English, in: Proceedings of LREC, 2014, pp.
     2897–2904. URL: http://www.lrec-conf.org/proceedings/lrec2014/summaries/1089.html.
[14] X. Pennec, Barycentric subspace analysis on manifolds, Annals of Statistics 46 (2018)
     2711–2746.