=Paper= {{Paper |id=Vol-3315/paper16 |storemode=property |title=TransMorpher: A Phonologically Informed Transformer-based Morphological Analyzer |pdfUrl=https://ceur-ws.org/Vol-3315/paper16.pdf |volume=Vol-3315 |authors=Karahan Şahin,Ümit Atlamaz }} ==TransMorpher: A Phonologically Informed Transformer-based Morphological Analyzer== https://ceur-ws.org/Vol-3315/paper16.pdf
TransMorpher: A Phonologically Informed
Transformer-based Morphological Analyzer
Karahan Şahin1 , Ümit Atlamaz2
1
    Cognitive Science Program, Boğaziçi University, İstanbul, Turkey
2
    Department of Linguistics, Boğaziçi University, İstanbul, Turkey


                                         Abstract
                                         We introduce TransMorpher, a phonologically informed Transformer-based morphological analyzer that
                                         returns disambiguated morphological parses. TransMorpher consists of a phonological normalization
                                         module (inspired by the modular architecture widely adopted in the field of Generative Linguistics),
                                         a character-based encoder with multi-head self-attention for encoding words, and a pre-trained lan-
                                         guage model for encoding context, and a decoder with multi-head self-attention. The language-specific
                                         phonological normalization module maps phonological variants of morphemes into unique abstract
                                         representations. Normalized words are fed to the word encoder, whose output representations are
                                         concatenated with the contextual representation of each word obtained from the pre-trained model.
                                         The concatenated representations are then fed to the decoder, which auto-regressively generates a
                                         single disambiguated morphological parse for each word. We evaluate TransMorpher on Turkish, an
                                         agglutinative language with rich morpho-phonological variation in a relatively low-resource setting, and
                                         obtain promising results with 85% accuracy. Our experiments show that phonological normalization
                                         contributes to a 5% gain in tag accuracy and 10% in lemma accuracy. We also tested our model without
                                         the phonological normalization module on Danish, Russian, and Finnish in low-resource contexts and
                                         achieved acceptable accuracy rates.

                                         Keywords
                                         morphological analysis, transformer, agglutinative morphology




1. Introduction
We introduce TransMorpher, a phonologically informed transformer-based morphological
analyzer that takes in sentences as input and produces disambiguated lemmas and morphological
features of each word as the output. TransMorpher is a two-level analyzer that consists of a
rule-based phonological normalization module and a sequence to sequence character translation
module using the original Transformer architecture [1].
   TransMorpher was explicitly developed for Turkish, an agglutinative language with 132
distinct derivational and inflectional morphemes and rich allomorphy due to vowel harmony
and other productive phonological processes. For example, the past tense suffix can be realized
as one of [dı, du, di, dü, tı, tu, ti, tü] depending on the final vowel and consonant of the stem
it attaches to. One immediate corollary of the phonological alternation is that it explodes the
morpheme space by outputting multiple surface forms for the same abstract morpheme. This
The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural
Language Processing (ALTNLP), June 7-8, 2022, Koper, Slovenia
Envelope-Open karahan.sahin@boun.edu.tr (K. Şahin); umit.atlamaz@boun.edu.tr (Ü. Atlamaz)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
complicates the morphological analysis task by increasing the variation in the dataset as well
as causing an imbalance in the distribution of surface forms of abstract morphemes, causing a
sparsity problem for some of the morphemes.
   To alleviate this problem, we first pass the surface forms through a phonological module
which normalizes the allophonic variants into a single abstract representation. This is inspired
by the inverted Y model adopted in the Generative Linguistics and Distributed Morphology
in particular, where morphology is considered to be the module that maps the output of the
syntactic component into the phonological component of the grammar [2]. Our experiments
show that phonological normalization increases accuracy significantly in low-resource settings.
While the phonological normalization module is specific to Turkish, the sequence to sequence
translation module is language-agnostic and can be used independently for other languages.
   The organization of the paper is as follows: Section 2 discusses related work, Section 3
discusses the details of the phonological normalization module and the sequence translation
model, Section 4 introduces the details of the datasets we use, Section 5 presents the experiments
and the results, Section 6 concludes the discussion.


2. Related Work
Morphological analysis of agglutinative languages has been dominated by Finite State Transduc-
ers. The two-level morphology developed by Koskenniemi [3] allows modeling both phonologi-
cal and morphological grammars that enable morphological analysis via Finite State machines.
Some of the early implementations of Finite State Technology involve the PC-KIMMO system
developed by Antworth [4] based on Koskenniemi [3] and the XFST Xerox Finite State Tool
developed by Beesley and Karttunen [5].
   Finite State Transducers have been widely used in morphological analysis of Turkish as
well. Oflazer [6] built a two-level morphological analyzer for Turkish based on the PC-KIMMO
system and later with the Xerox tools. Çöltekin [7] implemented the first publicly available free
two-level FST analyzer.
   One of the major challenges in the morphological analysis is ambiguity. FSTs end up gener-
ating multiple candidate analyses, which need further disambiguation. These candidates are
usually disambiguated with various machine learning techniques. Sak et al. [8] developed an
analyzer with a disambiguation component with the perceptron algorithm, and Sak et al. [9]
implemented a stochasticized FST to address the disambiguation problem in Turkish.
   Recently, a purely deep learning-based line of morphological analyzers started emerging.
Malaviya et al. [10] used a neural factor model for cross-lingual morphological analysis. Akyürek
et al. [11] introduced Morse, a recurrent encoder-decoder model that can do cross-lingual
morphological analysis and disambiguation jointly. Morse uses an LSTM to do left-to-right
character encoding of each word. Then, it uses a bi-directional LSTM to encode the context for
each word to get a unique contextual embedding for each word that allows the model to do
disambiguation. The decoder is also a unidirectional encoder. The main advantage of using
deep learning-based approaches is their ability to be used across many languages. Unlike FSTs,
they are not rule-based and can be quickly trained with sufficient data, alleviating the need for
tedious rule-writing by experts.
   TransMorpher takes Morse as a starting point and replaces the LSTM component with a
Transformer based encoder to encode tokens and a BERT-based model [12] to perform the
context encoding. In addition, it has a phonological normalization module that normalizes each
token before it is fed to the encoder. Like Morse, TransMorpher can do both morphological
analysis and disambiguation simultaneously. We achieve 85% accuracy in a low-resource setting
and show that the phonological normalization module significantly improves accuracy in this
low-resource setting. Although the accuracy of TransMorpher does not reach the current
state-of-the-art (98.59% as reported by Akyürek et al. [11]), it achieves promising results with
10 times less data.


3. TransMorpher Architecture
TransMorpher is a character-based sequence to sequence translation system inspired by Sutskever
et al. [13] and Akyürek et al. [11]. It has a Transformer based encoder-decoder module, a rule-
based phonological normalization module for Turkish, and a BERT [12] based contextual token
encoder for encoding the context of a word in a given sentence. TransMorpher takes the
input of a word and the sentence containing the word and returns a disambiguated lemma
and morphological analysis. Figure 1 depicts the TransMorpher architecture. In the following
subsections, we explain the details of each component.




Figure 1: TransMorpher Architecture



3.1. Phonological Normalization Module
One of the major hypotheses maintained in the field of Generative Linguistics has been the
modularity of the components of natural language. Ever since Chomsky and Lasnik [14], the
field of generative linguistics assumed the inverted Y model that embeds natural language
between the Conceptual-Intentional system and the Articulatory-Phonetic system [15, 16]. In
this modular view, the morphological component is considered to be embedded between the
syntactic component and the phonological component. This has been clearly articulated in
the modern generative theories of morphology like the Distributed Morphology by Halle and
Marantz [2].
   Our phonological normalization module is inspired by the modular division of labor standardly
assumed in the field of Generative Linguistics. Turkish is an agglutinative language with rich
morphology and phonology. The richness of phonological variation increases the number of
surface forms to be analyzed significantly. In addition, phonological variants introduce an
imbalance in the distribution of surface forms and raise sparsity problem for certain surface
forms. We alleviate these challenges by using a rule-based shallow phonological normalization
component that converts phonological variants of certain strings into a single abstract form.
For example, the two surface variants of the plural morpheme [ler] and [lar] are turned into
the single abstract form [lAr]. Similarly, the various surface forms of the past tense morpheme
[dı, du, di, dü, tı, tu, ti, tü] are transduced into [DI ]. An example of an input-output pair for the
phonological normalization module is given in (1).

(1)    Input: başlamışlardı
       Output: başlamışlArDI
       start.Nar.PastCop.V3pl
       ‘They had started.’

The phonological normalization module consists of a set of phonological rules that skip the
lemmas and apply to all the morphemes after the lemma. In its current implementation, it
mainly focuses on vowel harmony and some of the common phonological alternations (e.g.,
voicing, insertion, elision). It is not built as a comprehensive phonological module as some of
the phonological rules require morphological information and the task becomes counter-cyclic.
The main goal is to alleviate the sparsity problem in low-resource settings rather than providing
a precise phonological component.

3.2. Word Encoder
Word encoding is done via a character-level Transformer-based encoder. Each character 𝑤𝑖𝑗 is
mapped to an 𝑛-dimensional vector 𝑣𝑖𝑗 ∈ ℝ𝑛 . Input tokens are appended with [start] and [end]
tokens in addition to padding for tokens below the maximum sequence length. We used the
original Transformer encoder architecture introduced by Vaswani et al. [1]. The encoder is
composed of a stack of N = 3 identical layers. Each layer consists of a multi-head attention
sublayer with 8 attention units and a position-wise fully connected feed-forward network
sublayer. Each sublayer is followed by a residual connection and normalization. The left part in
Figure 1 depicts the word encoder.

3.3. Context Encoder
Context encoder is a BERT [12] based pre-trained language model that takes a sentence as
input and returns a contextual embedding for the target word. We specifically use BERTürk
by Schweter [17]. For a word 𝑤𝑖 , we define its corresponding context embedding 𝑐𝑖 ∈ ℝℎ as the
output of the BERT embeddings. The output of the context encoder is concatenated with the
output of the word encoder and passed through a linear transformation before it is fed into the
decoder.

3.4. Decoder
The decoder is also composed of a stack of N = 3 identical layers with an additional sub-layer in
each layer to perform multi-head attention over the embeddings created by the concatenation
of the word and context encoders. The output of the decoder stacks is passed through a linear
layer before they are passed through a softmax layer.


4. Datasets
We evaluated TransMorpher on Turkish in a relatively low-resource setting as well as with
Danish, Rusian, and Finnish to test its multilingual capabilities.

4.1. Turkish Dataset
There are several Turkish datasets that are available for training morphological analyzers
Hakkani-Tür et al. [18], Sak et al. [8], Sulubacak et al. [19], Akyürek et al. [11], Kayadelen et al.
[20] to name a few. We evaluate our model on TWT by Kayadelen et al. [20] as it is the only
gold standard dataset with sufficient data size.
  TWT is a gold standard dependency treebank for Turkish developed by Kayadelen et al.
[20]. TWT annotations were done manually by trained linguists with reported inter-annotator
agreement scores above 90%, indicating a high degree of consistency. TWT consists of 4,851
sentences scraped from Wikipedia and various websites and annotated with the Universal
Dependencies tags in the CoNLL-U format. Details of the data distribution are given in Table 1.

                                     Source    Sentence    Token
                                  Wikipedia        2310    39932
                                       Web         2541    26508
                                      Total        4851    66440
Table 1
Treebank Statistics of TWT [20]



  Following the convention introduced by Hakkani-Tür et al. [18], we convert the dependency
annotations for each token into morphological parses by concatenating the lemma, part-of-
speech tag, and morphological features as in (2).

(2)    𝑙𝑒𝑚𝑚𝑎 + 𝑃𝑂𝑆 − 𝑇 𝑎𝑔 + 𝐹 𝑒𝑎𝑡𝑢𝑟𝑒1 + ... + 𝐹 𝑒𝑎𝑡𝑢𝑟𝑒𝑛

After transforming the data into morphological parses, we randomly split the data into training,
validation, and test sets using a ratio of 70:15:15. Table 2 provides the details of the split.
                                              Data     Count
                                          Training     46512
                                         Validation     9966
                                               Test     9966
Table 2
Token Counts for Training, Validation, and Test Sets



4.2. Multilingual Datasets
We evaluated TransMorpher without the phonological normalization component on Danish1 ,
Russian2 , and Swedish3 , using the annotations from the Universal Dependencies Repository.
As the context encoder, we used Multilingual BERT Devlin et al. [12]. Table 3 summarizes the
token quantities in each dataset. We used a 80:10:10 ratio for training, validation, and test sets.

                                      Language     Token Count
                                        Danish          100,733
                                       Swedish           96,820
                                        Russian          98,000
Table 3
Token Counts for Danish, Swedish, & Russian




5. Experiments
In this section, we present our training procedure and experiment results. We first discuss
the results from Turkish experiments with and without phonological normalization. Then, we
present our results for Danish, Russian, and Finnish.

5.1. Training
All the character embeddings have 𝑛 = 512 dimensions, and all the hidden units have 𝐻 = 2048
dimensions. The context embeddings from the BERTürk model have ℎ = 512 dimensions. We
used Xavier initialization for initializing the model parameters [21].
  We trained the models using back-propagation through time with batch-gradient descent
with a batch size of 128. We used Cross-Entropy Loss as our loss function. We used Adam
optimizer with a learning rate of 𝑙𝑟 = 0.0001, betas between 0.9 and 0.98, and an epsilon value of
1e-9. Table 4 provides the details of our parameters. Vocabulary Size and Sequence Length were
determined by the Turkish data and the table only reflects the values for TWT. These values
were adjusted for each language.

    1
      https://github.com/UniversalDependencies/UD_Danish-DDT
    2
      https://github.com/UniversalDependencies/UD_Russian-GSD
    3
      https://github.com/UniversalDependencies/UD_Swedish-Talbanken
                                              Parameter     Value
                                        Vocabulary Size       322
                                       Sequence Length         20
                                  Embedding Dimensions        512
                                     Latent Dimensions       2048
                                        Attention Heads         8
                                              Batch Size      128
                                                  Epoch         9
Table 4
Encoder Parameters



5.2. Turkish Results
Table 5 presents the lemma, tag, and POS tag accuracies for the TWT dataset, as well as precision
and recall scores for tags. Accuracy is calculated via exact-match for the whole-tag, excluding
the lemma. POS tag and lemma accuracies are also calculated based on exact matches. Precision
and Recall scores are calculated on a tag-by-tag basis, assigning partial credit to partially correct
tag sets. The baseline numbers report the success metrics of the TransMorpher model without
the Phonological Normalization module or the context encoding for disambiguation.

                                        Accuracy     Precision    Recall   POS Acc    Lemma Acc
  Baseline                                  0.724        0.783     0.784      0.855        0.840
  Phonological Norm.                        0.777        0.839     0.838      0.886        0.940
  Contextual Embedding                     0.7933        0.906     0.899      0.947       0.9405
  Phonological Norm.+Context. Emb.        0.8501       0.9337    0.9323      0.9287      0.9703
Table 5
TWT Results


   Our results show that phonological normalization alone contributes to a 5% increase in tag
accuracy and 10% gain in lemma accuracy over the baseline. We also observe that the contextual
BERT embeddings yield a 7% gain on tag accuracy, whereas the impact on lemma accuracy
is equivalent to the impact of phonological normalization. Our best results come from the
combination of phonological normalization with contextual word embeddings. We make gains
across all the metrics. We observe around 13% accuracy gains (compared to the baseline) in both
tag accuracy and lemma accuracy, with tag accuracy reaching 85% and lemma accuracy reaching
97%. The only metric where the combined model does not outperform the other alternatives
is on POS Tag accuracy, where the best accuracy is achieved with just contextual embedding
without any phonological normalization.

5.3. Multilingual Results
To test the multilingual capabilities of TransMorpher, we evaluated it on Danish, Swedish, and
Russian without the phonological normalization module but with the contextual encoder. Table
6 presents the results. The success rates on the multilingual data are significantly lower than the
Turkish scores. There are two main factors behind this difference. First, there is no phonological
normalization module for these languages. Second, the multilingual BERT model used to encode
context in these languages is not performing at the same level as the language-specific BERT
model we used for Turkish. We believe that the accuracy will improve significantly once we
use language-specific language models for each language. We leave this for future research.

                                  Accuracy    Precision   Recall   Lemma
                       Danish        0.749        0.836    0.834     0.826
                       Swedish       0.649        0.719    0.719     0.703
                       Russian       0.615        0.525    0.527     0.840
Table 6
Experiment Results




6. Conclusion
We presented TransMorpher, a phonologically informed Transformer-based morphological ana-
lyzer. TransMorpher consists of a linguistically motivated phonological normalization module
and a Transformer-based encoder-decoder architecture with a BERT-based context encoder.
We evaluated TransMorpher on Turkish, an agglutinative language with rich phonological
variation, in a low-resource setting. TransMorpher takes an input of a target word and the
sentence containing it and returns a disambiguated morphological analysis for the target word
in that sentence. We achieve 85% accuracy on the TWT dataset [20]. Our experiments show
that the phonological normalization component contributes to a significant gain in accuracy.
Although TransMorpher does not reach state-of-the-art results reported in Akyürek et al. [11],
it achieves promising results with a dataset whose size is an order of magnitude less than
the TrMor2018 [11]. The logical next step is to evaluate TransMorpher on TrMor2018 and
other Turkish datasets to evaluate its success in high-resource settings and compare it to the
current state-of-the-art models. We leave this for future work. We also evaluated our model
on multilingual data without the rule-based phonological normalization module and achieved
acceptable results. We believe that multilingual results can be improved with better context
embedding models, which we leave for future investigation.


References
 [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, 2017. arXiv:1706.03762 .
 [2] M. Halle, A. Marantz, Distributed morphology and the pieces of inflection, in: K. Hale,
     S. J. Keyser (Eds.), The view from Building 20: Essays in Linguistics in honour of Sylvain
     Bromberger, 1993, pp. 111–176.
 [3] K. Koskenniemi, Two-level model for morphological analysis., in: IJCAI, volume 83, 1983,
     pp. 683–685.
 [4] E. L. Antworth, Pc-kimmo: a two-level processor for morphological analysis, Summer
     Institute of Linguistics (1990).
 [5] K. R. Beesley, L. Karttunen, Finite-state morphology: Xerox tools and techniques, CSLI,
     Stanford (2003).
 [6] K. Oflazer, Two-level description of turkish morphology, Literary and linguistic computing
     9 (1994) 137–148.
 [7] Ç. Çöltekin, A freely available morphological analyzer for Turkish, in: Proceedings of
     the Seventh International Conference on Language Resources and Evaluation (LREC’10),
     European Language Resources Association (ELRA), Valletta, Malta, 2010. URL: http://www.
     lrec-conf.org/proceedings/lrec2010/pdf/109_Paper.pdf.
 [8] H. Sak, T. Güngör, M. Saraçlar, Morphological disambiguation of turkish text with percep-
     tron algorithm, in: International Conference on Intelligent Text Processing and Computa-
     tional Linguistics, Springer, 2007, pp. 107–118.
 [9] H. Sak, T. Güngör, M. Saraçlar, A stochastic finite-state morphological parser for turkish,
     in: Proceedings of the ACL-IJCNLP 2009 Conference short papers, 2009, pp. 273–276.
[10] C. Malaviya, M. R. Gormley, G. Neubig, Neural factor graph models for cross-lingual
     morphological tagging, in: Proceedings of the 56th Annual Meeting of the Association for
     Computational Linguistics, volume 1, 2018, p. 2653–2663.
[11] E. Akyürek, E. Dayanık, D. Yüret, Morphological analysis using a sequence decoder,
     Transactions of the Association for Computational Linguistics 7 (2019) 567–579. URL:
     https://aclanthology.org/Q19-1036. doi:10.1162/tacl_a_00286 .
[12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[13] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks,
     Advances in neural information processing systems 27 (2014).
[14] N. Chomsky, H. Lasnik, Filters and control, Linguistic Inquiry 8 (1977) 425–504. URL:
     http://www.jstor.org/stable/4177996.
[15] N. Chomsky, Lectures on government and binding: The Pisa lectures, 9, Walter de Gruyter,
     1993.
[16] N. Chomsky, The minimalist program, MIT press, 1995.
[17] S. Schweter, Berturk - bert models for turkish, 2020. URL: https://doi.org/10.5281/zenodo.
     3770924. doi:10.5281/zenodo.3770924 .
[18] D. Z. Hakkani-Tür, K. Oflazer, G. Tür, Statistical morphological disambiguation for agglu-
     tinative languages, Computers and the Humanities 36 (2002) 381–410.
[19] U. Sulubacak, M. Gokirmak, F. Tyers, Ç. Çöltekin, J. Nivre, G. Eryiğit, Universal Dependen-
     cies for Turkish, in: ”Proceedings of COLING 2016, the 26th International Conference on
     Computational Linguistics: Technical Papers”, ”The COLING 2016 Organizing Committee”,
     Osaka, Japan, 2016, pp. 3444–3454. URL: https://aclanthology.org/C16-1325.
[20] T. Kayadelen, A. Ozturel, B. Bohnet, A gold standard dependency treebank for Turkish,
     in: Proceedings of the 12th Language Resources and Evaluation Conference, European
     Language Resources Association, Marseille, France, 2020, pp. 5156–5163. URL: https:
     //aclanthology.org/2020.lrec-1.634.
[21] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural
     networks, in: Y. W. Teh, M. Titterington (Eds.), Proceedings of the Thirteenth International
Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine
Learning Research, PMLR, Chia Laguna Resort, Sardinia, Italy, 2010, pp. 249–256. URL:
https://proceedings.mlr.press/v9/glorot10a.html.