=Paper=
{{Paper
|id=Vol-3315/paper07
|storemode=property
|title=Boosting Dependency Parsing Performance by Incorporating Additional Features for Agglutinative Languages
|pdfUrl=https://ceur-ws.org/Vol-3315/paper07.pdf
|volume=Vol-3315
|authors=Mücahit Altıntaş,Ahmet Cüneyd Tantuğ
}}
==Boosting Dependency Parsing Performance by Incorporating Additional Features for Agglutinative Languages==
<pdf width="1500px">https://ceur-ws.org/Vol-3315/paper07.pdf</pdf>
<pre>
Boosting Dependency Parsing Performance by
Incorporating Additional Features for Agglutinative
Languages
Mücahit Altıntaş1,2 , A. Cüneyd Tantuğ1
1
  Faculty of Computer Science, Natural Language Processing and Social Robotic Lab, Istanbul Technical University,
34469, Maslak, Istanbul, Turkey
2
  Faculty of Engineering, Bayburt University, 69002, Bayburt, Turkey


           Abstract
           In recent studies, the use of language models has increased noticeably and has made quite good contribu-
           tions. However, using the proper representation and taking into account the complementary components
           are still among the issues to be considered. In this research, the impact of sub-word level sentence
           piece based word representation on the performance of dependency parsing has been demonstrated
           for agglutinative languages. Furthermore, we propose to use the sentence representation that holds
           all meaning of the sentence as an additional feature to improve dependency parsing. Our proposed
           enhancements are experimented on nine agglutinative languages; Estonian, Finnish, Hungarian, Indone-
           sian, Japanese, Kazakh, Korean, Turkish, and Uyghur. We found that the sentence piece based token
           encoding has contributed parsing performance for the majority of the experimented languages. Using
           the entire meaning of the sentence as a complementary feature has enhanced parsing performance for
           six languages out of nine.

           Keywords
           agglutinative languages, dependency parsing, sentence piece, sentence representation


1. Introduction
Dependency parsing is one of the core components of natural language computation that iden-
tifies syntactic relationships among the words within a sentence. It is crucial for several natural
language processing (NLP) downstream tasks. Zhou et al. [1] employed dependency parsing
to obtain semantic representation in order to enhance text-to-speech. Luo et al. [2] applied
dependency parsing knowledge as supplementary information, which allows the question
answering (QA) model to better match within the semantic component of the question. Zhang
et al. [3] utilized the encoder outputs of dependency parser as the inputs for the Seq2Seq neural
machine translation (NMT) model by training both dependency parsing and machine translation
model parameters concurrently. Cai and Lapata [4], Xia et al. [5] reported that syntax-aware
representation improves the semantic role labeling (SRL) performance.
    In linguistic typology, agglutinative languages are a subcategory of morphologically rich
languages that present a significant challenge for NLP research. With their rich morpho-syntax,

The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural
Language Processing (ALTNLP), June 7-8, Koper, Slovenia
Envelope-Open maltintas@itu.edu.tr (M. Altıntaş); tantug@itu.edu.tr (A. C. Tantuğ)
         © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
a word may contain many morphemes, each of which is responsible for supplying the word
with grammatical function or endowing new meaning. A word may have numerous different
surface form, that entails the out of vocabulary (OOV) or data sparsity problems. To abate these
problems, sub-word level representations have been proposed in the literature. Dos Santos
and Zadrozny [6], Kim et al. [7] reported that characters level word representation improves
performances for word-level tasks. Yu et al. [8] proposed to use syllable-level word embedding
in morphologically rich languages such as Korean. Bojanowski et al. [9] introduced an extension
of the continuous skip-gram model in which words are represented as the sum of the n-gram
character vectors. However, agglutinative languages convey grammatical information through
inflections, so they tend to have more flexible word order. This case causes discontinuous
constituents that impose non-projectivity in dependency structures [10]. Fortunately, splitting
their morphemes is simple since each piece of grammatical information is contained in a single
morpheme or vice versa. Eryiğit and Oflazer [11] demonstrated that considering morphemes as
the primary units of syntactic structure rather than word forms improves parsing accuracy for
an agglutinative language, Turkish. Özateş et al. [12] made use of the morpheme information
and hand-crafted rules to improve the word vector representation in dependency parsing.
   In this paper, we propose two enhancements to increase dependency parsing accuracy of
agglutinative languages in particular, but not restricted with them only.

    • We employ sub-word level sentence piece [13] based on word representation to capture
      morphemes more precisely and also attenuate the OOV (out of vocabulary) and data
      sparsity problems. Sentence piece is a neural network-based universal sub-word tokenizer
      that is language independent.
    • As a complementary feature to token features, we use sentence representation that holds
      the whole meaning of the sentence. It is based on the fact that sentences with the same
      meaning but different word orders have the same dependency tree structure.

We investigate the impact of our proposed improvements to dependency parsing accuracy on
nine widely used agglutinative languages; Estonian, Finnish, Hungarian, Indonesian, Japanese,
Kazakh, Korean, Turkish, and Uyghur.


2. Approach
Our proposed model is an enhancement on the experiment described by Dozat and Manning [14].
The enhanced model comprises an LSTM-based encoder and biaffine classifiers. Sub-word level
representations; character based and sentence piece [15] based are obtained by using attention
mechanism over hidden states of a single LSTM layer. Three bi-directional LSTM layers are uti-
lized to make the concatenation of token and sub-token embeddings context-aware. Pre-trained
word embedding is added to the model after these bi-LSTM layers. Sentence representation that
is obtained by concatenating the last hidden states of bi-LSTM and sentence vectors that comes
from the pre-trained model is also employed as an extra feature by broadcasting for each word
in the sentence. Figure 1 illustrates our proposed neural dependency parser architecture.
   To express in formulas, 𝑠 is a sentence that includes 𝑛 words and is represented as 𝑠 =
𝑤0 , 𝑤1 , ..., 𝑤𝑛 where 𝑤0 is added synthetically as the ROOT token. Each word 𝑤𝑖 can be represented
                                                                                                       Score Matrix
                                   Dependency
                                   Score Matrix


                                                                                                         Relation
Biaffine Classifier               Biaffine                                                            Biaffine


      MLP              Linear Layer               Linear Layer                          Linear Layer             Linear Layer


                                                                    ...                         ...
                                                    Concatenation


 Feature Encoder
     Output


                                                                                                                                 Word Vector of Pre-trained Model such as BERT, ELECTRA etc
                                                                               Representation
                                                                               Broadcasting
                                                                                 Sentence


                                                                                                                                Sentence Vector of Pre-trained Model such as BERT, ELECTRA etc


                                                                                                                                                                                                             Concatenation
                            ...


                                                                                                                                                                                                               Attention


                                                                                                                                                                                       ...
                                                                                                                                                                                                             LSTM Layer
 Concatenation                                                                   ...    ...
                                                                                                                                                                                       ...                 Embedding Layer

                                                                                  ...
                                                                                                                                             subtokeni,0

                                                                                                                                                           subtokeni,1

                                                                                                                                                                         subtokeni,2


                                                                                                                                                                                             subtokeni,k


  BiLSTM Layer
                                                                                                                                    tokeni


                                                                                                                                                                                       ...
                                                                                        ...


                      x0                      x1                          x2      ...           xn


Figure 1: Our neural dependency parser model architecture


by a combination of surface form (𝑢𝑖 ), lemma (𝑙𝑖 ), POS tag (𝑡𝑖 ), morphological feature (𝑚𝑖 ),
character (𝑐𝑖 ), and sentence piece (𝑝𝑖 ) based on characteristics of the word, respectively, as given
below (Equ. 1).
                                           𝑤𝑖 = 𝑢 𝑖 , 𝑙 𝑖 , 𝑡 𝑖 , 𝑚 𝑖 , 𝑐 𝑖 , 𝑝 𝑖                   (1)
   Here, 𝑐𝑖 and 𝑝𝑖 are sub-word level features of the word while 𝑢𝑖 , 𝑙𝑖 , 𝑡𝑖 and 𝑚𝑖 are word level
features.
   Encoder: The concatenation of word level (Equ. 3) and sub-word level (Equ. 4) embedding
vectors yields the vector 𝑥𝑖 that is used as input to the bi-LSTM layers. (Equ. 2).
                                         𝑥𝑖 = 𝑡𝑜𝑘𝑒𝑛𝑖 ⊕ 𝑠𝑢𝑏𝑡𝑜𝑘𝑒𝑛𝑖                                   (2)
                                𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑒(𝑢𝑖 ) ⊕ 𝑒(𝑙𝑖 ) ⊕ 𝑒(𝑡𝑖 ) ⊕ 𝑒(𝑚𝑖 )                         (3)
                                        𝑠𝑢𝑏𝑡𝑜𝑘𝑒𝑛𝑖 = 𝑓 (𝑐𝑖 ) ⊕ 𝑓 (𝑝𝑖 )                              (4)
   A sub-word representation is obtained by using attention on the stacked hidden states of a
single layer LSTM. (Equ. 5).

                                                𝑓 (𝑝) = 𝐻𝑝𝑇 𝑎𝑝                                     (5)
                                       𝑎𝑝 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝐻𝑝 𝑤𝑝𝑎𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛 )                               (6)
                                           𝐻𝑝 = [ℎ0 ; ℎ1 ; ...; ℎ𝑃 ]                               (7)
                                   𝑟𝑖 = 𝐿𝑆𝑇 𝑀((𝑒(𝑝𝑖,0 ), ..., 𝑒(𝑝𝑖,𝑃 ))                            (8)

                                           (⃖⃖⃗
                                            ℎ𝑘 , ⃖⃖⃖⃗
                                                 ℎ𝑃 ) = 𝑠𝑝𝑙𝑖𝑡(𝑟𝑖 , 𝑘)                              (9)
   where 𝑝𝑖 = 𝑝𝑖,1 , 𝑝𝑖,2 , ..., 𝑝𝑖,𝑃 is the sequence of the sub-word features of the word and 𝑃 is the
number of sub-word features that may be sentence piece or characters of the word.
   A multi-layer Bi-LSTM (Equ. 10) is used to generate contextual word representations over 𝑥𝑖 s.
External contextualized word representation that may be obtained from ELECTRA or BERT is
concatenated with right and left hidden states of the corresponding word on the last Bi-LSTM
layer(Equ. 12).

                                        𝑟 = 𝐵𝑖𝐿𝑆𝑇 𝑀((𝑥0 , ..., 𝑥𝑛 ))                              (10)
                                       (⃖⃖⃖
                                        ℎ𝑖 , ⃖⃖⃗
                                             ℎ𝑖 ), (⃖⃖⃖⃖
                                                    ℎ0 , ⃖⃖⃖⃗
                                                         ℎ𝑛 ) = 𝑠𝑝𝑙𝑖𝑡(𝑟, 𝑖)                       (11)
                                           𝑧𝑖 = 𝑇 (𝑢𝑖 ) ⊕ ⃖⃖⃖
                                                          ℎ𝑖 ⊕ ⃖⃖⃗
                                                               ℎ𝑖                                 (12)
where 𝑇 (𝑢𝑖 ) denotes pre-trained model vector of the word surface form 𝑢𝑖 .
  To represent a sentence, the pre-trained model sentence embedding vector is concatenated
with the final hidden states of the last bi-LSTM layer’s backward and forward directions
respectively (Equ. 13).
                               𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒𝑣𝑒𝑐𝑡𝑜𝑟 = 𝑇 (′ 𝐶𝐿𝑆 ′ ) ⊕ ⃖⃖⃖⃖
                                                               ℎ0 ⊕ ⃖⃖⃖⃗
                                                                    ℎ𝑛                (13)
where 𝑇 (′ 𝐶𝐿𝑆 ′ ) provides the sentence representation.
   Classifier: Deep bi-affine attention, as proposed in Dozat and Manning [14], is employed as
a classifier. Multi-layer perceptron (MLP) are used to get concentrated characteristics of word
representation as head (Equ. 15) and dependent (Equ. 14). Then, these representations are input
into a bi-affine attention mechanism, which provides a score vector expressing the likelihood of
being the parent for each word in the sentence (Equ. 17).
                                       (𝑎𝑟𝑐−𝑑𝑒𝑝)
                                   ℎ𝑖              = 𝑀𝐿𝑃 (𝑎𝑟𝑐−𝑑𝑒𝑝) (𝑧𝑖 )                          (14)
                                    (𝑎𝑟𝑐−ℎ𝑒𝑎𝑑)
                                  ℎ𝑖               = 𝑀𝐿𝑃 (𝑎𝑟𝑐−ℎ𝑒𝑎𝑑) (𝑧𝑖 )                         (15)
                                                                              (𝑎𝑟𝑐−ℎ𝑒𝑎𝑑)                                   (𝑎𝑟𝑐−ℎ𝑒𝑎𝑑)
                                   𝐻 (𝑎𝑟𝑐−ℎ𝑒𝑎𝑑) = [ℎ0                                                 ; ...; ℎ𝑛                                  ]                                                              (16)
                               (𝑎𝑟𝑐)                                                                                          (𝑎𝑟𝑐−𝑑𝑒𝑝)
                              𝑠𝑖           = 𝑏𝑖𝑎𝑓 𝑓 𝑖𝑛𝑒 (𝑎𝑟𝑐) (𝐻 (𝑎𝑟𝑐−ℎ𝑒𝑎𝑑) , ℎ𝑖                                                                      )                                                         (17)
   Similarly, another bi-affine classifier is employed to compute the dependency label probabili-
ties of the relevant word with each probable head (Equ. 21).
                                                 (𝑟𝑒𝑙−𝑑𝑒𝑝)
                                             ℎ𝑖                               = 𝑀𝐿𝑃 (𝑟𝑒𝑙−𝑑𝑒𝑝) (𝑧𝑖 )                                                                                                             (18)
                                              (𝑟𝑒𝑙−ℎ𝑒𝑎𝑑)
                                            ℎ𝑖                                = 𝑀𝐿𝑃 (𝑟𝑒𝑙−ℎ𝑒𝑎𝑑) (𝑧𝑖 )                                                                                                            (19)
                                                                              (𝑟𝑒𝑙−ℎ𝑒𝑎𝑑)                                   (𝑟𝑒𝑙−ℎ𝑒𝑎𝑑)
                                    𝐻 (𝑟𝑒𝑙−ℎ𝑒𝑎𝑑) = [ℎ0                                                ; ...; ℎ𝑛                                  ]                                                              (20)
                                   (𝑟𝑒𝑙)                                                                                     (𝑟𝑒𝑙−𝑑𝑒𝑝)
                               𝑠𝑖          = 𝑏𝑖𝑎𝑓 𝑓 𝑖𝑛𝑒 (𝑟𝑒𝑙) (𝐻 (𝑟𝑒𝑙−ℎ𝑒𝑎𝑑) , ℎ𝑖                                                                     )                                                          (21)
   These two bi-affine classifiers are jointly trained in the training phase with respect to the sum
of cross-entropy losses. The Chu-Lie-Edmonds approach is utilized during testing to extract the
greatest spanning tree from the resultant score matrices.


3. Experiment and Results
Our proposed enhancements on dependency parsing have been evaluated on nine agglutina-
tive languages, namely, Estonian, Finnish, Hungarian, Indonesian, Japanese, Kazakh, Korean,
Turkish, and Uyghur. Table 1 lists some details of utilized treebanks. Indonesian belongs to the
Austronesian language family. Official splits of treebanks have been used. All of the scores in
this research are acquired on the test set by the associated model, which was trained on the
related treebank’s training set. Uyghur UDT treebank has no validation set. Thus, we have
used the test set to ensure that the training process was not over-training.

Table 1
Some prominent properties of used treebanks in our experiment.
                                                                                  Hungarian Szeged


                                                                                                          Indonesian GSD


                                                                                                                                  Japanese GSD
                                              Estonian EDT


                                                                                                                                                                                            Turkish IMST


                                                                                                                                                                                                              Uyghur UDT
                                                                                                                                                                           Korean GSD
                                                                Finnish TDT


                                                                                                                                                          Kazakh KTB


   Sentence Count in Train Set             24633              8054               910                   4477                     7050                     31              4400             3664               1656
   Sentence Count in Test Set               3214              1555               449                    557                      543                  1047                989              983                900
   Sentence Count in Validation Set         3125              1364               441                    559                      507                      0               950              988                   -
   Unique Token Count                      80195             53881             13469                  20179                    21680                  4387              35846            17105              12068
   Average Token Count in Sentences         14.13             13.35             23.35                  21.55                    23.90                  9.77              12.67            10.26              11.64
   Language Family                         Uralic            Uralic            Uralic                Austro.                 Japonic                 Turkic            Korean           Turkic             Turkic


  The hyper-parameters are listed in Table 3a. For all languages, the same hyper-parameters
were employed. The pre-trained word embeddings have been used in the following order;
ELECTRA [16], BERT [17], ELMo, and word2vec. In other words, if there is no trained ELECTRA
language model (LM) for the relevant language, BERT LM has been used, if BERT LM does
not exist, ELMo has been employed to obtain pre-trained word vector, if there is no ELMo,
word2vec pre-trained vectors have been exploited. Table 3b explains which pre-trained word
vectors have been used for each language. We only utilized the corresponding vector of the
first word piece per word, disregarding the remainders for words that may consist of multiple
word pieces in BERT and ELECTRA. We have used the Xaiver uniform initialization [18] with
the same random seed for all our experiments.

Table 2
(a)Hyper-parameters and (b)how pre-trained word vectors are obtained per language
 Hyper Parameters
                                                                 Language       Pre-trained Vec.      Ref.
 Num. of word-level Bi-LSTM layers          3
 Word embedding dim.                        75                   Estonian             BERT
 Tag embedding dim.                         50                   Finnish              BERT            [19]
 Sub-token embedding dim.                   100                  Hungarian            BERT            [20]
 Arc vector dim.                            512                  Indonesian           BERT
 Label vector dim.                          128                  Japanese             BERT
 Dropout rate                               0.5                  Kazakh             word2vec          [21]
 Optimizer                                  AdamW                Korean             ELECTRA           [22]
 𝛽1                                         0.900                Turkish            ELECTRA
 𝛽2                                         0.999                Uyghur               ELMo            [23]
 Learning rate                              5e-5
                                                                                    (b)
                         (a)

   To obtain sub-token based word representations, the first ten sentence pieces of words and
the first twenty characters of words have been used and the rests have been ignored. During
training, one more layer is fine-tuned in each iteration, starting from the final layer of the
pre-trained model. The AdamW optimizer [24] is employed with a linear schedule warm-up.
   As evaluation metrics, the word-based unlabeled attachment score (UAS) and labeled attach-
ment score (LAS) are utilized. CoNLL 2018 UD Shared Task evaluation script1 has been used to
calculate UAS and LAS.
   To manifest the impact of our proposed enhancements; sentence piece based word repre-
sentation and complimentary sentence representation, we provide the UAS and LAS of our
three models. Our benchmark model uses sentence piece based word representation but not
sentence representation. The other two models are the model without using sentence piece
based word representation, and the model with using sentence representation. Table 4 shows the
performances of our models and some previous models that are trained with gold annotations
for the same treebanks; Udify [25], UDPipe 2.0 [26], UDPipe 2.0 with using BERT and Flair
pre-trained word embeddings [26].
   The UDify [25] model intends to create a single parsing model for 75 languages in the UD
dataset, leveraging the multilingual BERT model which has been trained on the top largest 104
languages on Wikipedias. This parser demonstrates that languages with minimal labeled data
can be parsed by using data from other languages. The encoder output was obtained using an

   1
       The evaluation script can be downloaded from http://universaldependencies.org/conll18/conll18_ud_eval.py
Table 4
The UAS and LAS of our models and previous models on the test set of the corresponding treebank


                                                                         UDPipe 2.0 with BERT+Flair [26]


                                                                                                            Our Model w/o Sentence Piece


                                                                                                                                                         Our Model with Sent. Repr.
                                                      UDPipe 2.0 [26]


                                                                                                                                            Our Model
                                         UDify [25]
                                 UAS    89.53         88.00             89.46                              90.97                           90.72        90.81
              Estonian EDT
                                 LAS    86.67         85.18             86.77                              88.38                           88.18        88.31
                                 UAS    86.42         89.88             91.66                              94.10                           94.39        94.29
              Finnish TDT
                                 LAS    82.03         87.46             89.49                              92.65                           92.85        92.70
                                 UAS    89.68         84.04             88.76                              90.44                           90.52        90.76
              Hungarian Szeged
                                 LAS    84.88         79.73             85.12                              86.77                           87.00        87.34
                                 UAS    86.45         85.31             86.47                              85.41                           85.50        85.19
              Indonesian GSD
                                 LAS    80.10         78.99             80.40                              78.11                           78.23        78.39
                                 UAS    94.37         95.06             95.55                              94.24                           94.28        94.54
              Japanese GSD
                                 LAS    92.08         93.73             94.24                              93.08                           93.25        93.57
                                 UAS    74.77         53.30             57.02                              64.85                           63.35        62.95
              Kazakh KTB
                                 LAS    63.66         33.38             38.72                              46.30                           44.70        44.34
                                 UAS    82.74         87.70             89.38                              92.00                           92.06        92.24
              Korean GSD
                                 LAS    74.26         84.24             86.05                              89.36                           89.42        89.48
                                 UAS    74.56         74.19             76.30                              81.44                           82.18        82.70
              Turkish IMST
                                 LAS    67.44         67.56             70.11                              75.72                           76.19        76.51
                                 UAS    65.89         78.46             79.10                              76.30                           78.68        76.45
              Uyghur UDT
                                 LAS    48.80         67.09             67.46                              64.00                           67.44        64.44


attention mechanism through layers of the pre-trained model.
   UDPipe 2.0 [27] is an NLP tool that also includes a dependency parser. Except for a few
minor differences, its architecture is nearly identical to that of our base parser. It utilizes
character-based word representation obtained by bi-directional GRU (gated recurrent units)
as only sub-word level representation. They employ three forms of embeddings to represent
each input word: pre-trained word embedding, trained word embedding, and character-based
word embedding. Straka et al. [26] looked into the impact of utilizing both BERT and Flair word
vectors on UDPipe 2.0.
   The results show that the sentence piece based word representation has contributed to all
experimented languages other than Estonian and Kazakh. Sentence representation has improved
parsing performance for Estonian, Hungarian, Japanese, Korean and Turkish. In Indonesian,
sentence representation has boosted the LAS while slightly decreased the UAS. In Finnish,
Kazakh and Uyghur, sentence representation has had a little unfavorable affect on the UAS
and the LAS. We have achieved higher scores than previously reported in [25, 26] for Estonian,
Finnish, Hungarian, Korean, and Turkish.


4. Discussion and Conclusion
In this study, we propose to employ sub-word level sentence piece based word representation
and sentence representation that stores the entire meaning of the sentence in order to boost
dependency parsing performance. Although the proposed improvements are applicable to
all languages, we experiment their influence on a subset of languages; the nine agglutinative
languages. We intend to alleviate the challenges of dependency parsing for agglutinative
languages due to their unique characteristics such as rich morpho-syntax, flexible word order,
and so on.
   With the exception of Estonian and Kazakh, sentence piece based token encoding improves
parsing performance by capturing morphemes in all languages tested. Despite being an aggluti-
native language, Estonian borrows about a third of its vocabulary from Germanic languages. We
think that this is why sentence piece-based word encoding does not increase parsing accuracy
in this language. The obtained result for Kazakh is attributed to a data shortage, because the
Kazakh training set has just 31 sentences. Due to a lack of learning data, parsing accuracy
diminishes as the number of learned parameters grows with each additional feature. In Estonian,
Hungarian, Japanese, Korean, Turkish and partially Indonesian, employing sentence representa-
tion as an additional feature improves the parsing accuracy. Because the entire meaning of the
sentence contributes to extract syntactic information. We construct our sentence representation
by concatenating the latest hidden states of bi-LSTM backward and forward directions, as well
as ELECTRA or BERT-based sentence vectors where they are available. However, because
there are no publicly accessible ELECTRA or BERT pre-trained LMs for the Kazakh and Uyghur
languages, the sentence representations of both of these languages rely only on the final hidden
states of backward and forward directions of bi-LSTM. Additionally, training data of these
languages are relatively small to fit to provide well-learned sentence representation. As a result,
using sentence representation in these languages is ineffective in improving parsing accuracy.
For Finnish, we received an unexpected result. Finnish has a large vocabulary because it is a
highly morphological rich language. Because of the vast quantity of the vocabulary, pre-trained
LM tokenizers of this language mostly granulates the token into word pieces that represent
morphemes rather than words. We only used the matching vector of the first word piece per
word when fine-tuning BERT or ELECTRA LM, ignoring the remainders. We suspect that
sentence vectors loses syntactic information, because of disregarding some word pieces carry
syntactic information. This might be why the sentence representation is unable to increase
parsing performance for Finnish.
   In conclusion, sub-word units and morpho-syntactic features are critical to identifying
the syntactic function of the word for agglutinative languages. Sentence piece based word
representation contributes to capturing morphemes of the word and enhances parsing accuracy.
Furthermore, with a few exceptions, sentence representation that stores the whole meaning of
the sentence increases parsing performance for the majority of languages.


Acknowledgments
We would like to thank Wiseborn M. Danquah and my dear wife Şeyma Altıntaş for their
insightful remarks, as well as all of the other anonymous reviewers who took the time and
effort to review this research.


References
 [1] Y. Zhou, C. Song, J. Li, Z. Wu, H. Meng, Dependency parsing based semantic representation
     learning with graph neural network for enhancing expressiveness of text-to-speech, arXiv
     preprint arXiv:2104.06835 (2021).
 [2] K. Luo, F. Lin, X. Luo, K. Zhu, Knowledge base question answering via encoding of complex
     query graphs, in: Proceedings of the 2018 Conference on Empirical Methods in Natural
     Language Processing, 2018, pp. 2185–2194.
 [3] M. Zhang, Z. Li, G. Fu, M. Zhang, Syntax-enhanced neural machine translation with
     syntax-aware word representations, arXiv preprint arXiv:1905.02878 (2019).
 [4] R. Cai, M. Lapata, Syntax-aware semantic role labeling without parsing, Transactions of
     the Association for Computational Linguistics 7 (2019) 343–356.
 [5] Q. Xia, Z. Li, M. Zhang, M. Zhang, G. Fu, R. Wang, L. Si, Syntax-aware neural semantic role
     labeling, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 33,
     2019, pp. 7305–7313.
 [6] C. Dos Santos, B. Zadrozny, Learning character-level representations for part-of-speech
     tagging, in: International Conference on Machine Learning, PMLR, 2014, pp. 1818–1826.
 [7] Y. Kim, Y. Jernite, D. Sontag, A. M. Rush, Character-aware neural language models, in:
     Thirtieth AAAI conference on artificial intelligence, 2016, pp. 2741–2749.
 [8] S. Yu, N. Kulkarni, H. Lee, J. Kim, Syllable-level neural language model for agglutinative
     language, arXiv preprint arXiv:1708.05515 (2017).
 [9] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
     information, Transactions of the association for computational linguistics 5 (2017) 135–146.
[10] R. Tsarfaty, D. Seddah, Y. Goldberg, S. Kübler, Y. Versley, M. Candito, J. Foster, I. Rehbein,
     L. Tounsi, Statistical parsing of morphologically rich languages (spmrl) what, how and
     whither, in: Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of
     Morphologically-Rich Languages, 2010, pp. 1–12.
[11] G. Eryiğit, K. Oflazer, Statistical dependency parsing of turkish, Sabanci University
     Research Database (2006).
[12] Ş. B. Özateş, A. Özgür, T. Güngör, B. Öztürk, A hybrid approach to dependency parsing:
     Combining rules and morphology with deep learning, arXiv preprint arXiv:2002.10116
     (2020).
[13] T. Kudo, Subword regularization: Improving neural network translation models with
     multiple subword candidates, arXiv preprint arXiv:1804.10959 (2018).
[14] T. Dozat, C. D. Manning, Deep biaffine attention for neural dependency parsing, arXiv
     preprint arXiv:1611.01734 (2016).
[15] T. Kudo, J. Richardson, Sentencepiece: A simple and language independent subword
     tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226
     (2018).
[16] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as
     discriminators rather than generators, arXiv preprint arXiv:2003.10555 (2020).
[17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[18] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural net-
     works, in: Proceedings of the thirteenth international conference on artificial intelligence
     and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 249–256.
[19] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, S. Pyysalo,
     Multilingual is not enough: Bert for finnish, arXiv preprint arXiv:1912.07076 (2019).
[20] D. M. Nemeskey, Natural Language Processing Methods for Language Modeling, Ph.D.
     thesis, Eötvös Loránd University, 2020.
[21] F. Ginter, J. Hajič, J. Luotolahti, M. Straka, D. Zeman, CoNLL 2017 shared task - auto-
     matically annotated raw texts and word embeddings, 2017. URL: http://hdl.handle.net/
     11234/1-1989, LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied
     Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
[22] K. Kim, Pretrained language models for korean, https://github.com/kiyoungkim1/LMkor,
     2020.
[23] W. Che, Y. Liu, Y. Wang, B. Zheng, T. Liu, Towards better UD parsing: Deep contextu-
     alized word embeddings, ensemble, and treebank concatenation, in: Proceedings of the
     CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependen-
     cies, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 55–64. URL:
     http://www.aclweb.org/anthology/K18-2005.
[24] I. Loshchilov, F. Hutter, Fixing weight decay regularization in adam, 2018. URL: https:
     //openreview.net/forum?id=rk6qdGgCZ.
[25] D. Kondratyuk, M. Straka, 75 languages, 1 model: Parsing universal dependencies univer-
     sally, arXiv preprint arXiv:1904.02099 (2019).
[26] M. Straka, J. Straková, J. Hajič, Evaluating contextualized embeddings on 54 languages
     in pos tagging, lemmatization and dependency parsing, arXiv preprint arXiv:1908.07448
     (2019).
[27] M. Straka, Udpipe 2.0 prototype at conll 2018 ud shared task, in: Proceedings of the CoNLL
     2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 2018,
     pp. 197–207.

</pre>