Time for some German? Pre-Training a
Transformer-based Temporal Tagger for German
Satya Almasian1,0 , Dennis Aumiller1,0 and Michael Gertz1
1
    Institute of Computer Science, Heidelberg University, Heidelberg, Germany
0
    These authors contributed equally to this work


                                 Abstract
                                 Non-English languages are notorious for their lack of available resources, and temporal tagging is no
                                 exception. In this work, we explore transfer strategies to improve the quality of a German temporal
                                 tagger. From a model perspective, we employ a weakly-supervised pre-training strategy to stabilize
                                 the convergence of Transformer-based taggers. In addition, we also augment data with automatically
                                 translated English resources, which serve as an alternative to commonly used alignments of latent em-
                                 bedding spaces. With this, we provide preliminary empirical evidence that indicates the suitability of
                                 transfer approaches to other low-resourced languages: A small number of gold data coupled with an
                                 existing data set in a resource-rich language and a weak labeling baseline system may be su�cient to
                                 boost performance.

                                 Keywords
                                 Temporal tagging, Weakly-supervised learning, German


1. Introduction
Annotated data has become an essential part of modern-day NLP approaches, but non-English
resources remain scarce. In the absence of data, it then becomes increasingly di�cult to even
transfer existing approaches to a multilingual context. In this work, we particularly focus on
the task of Temporal Tagging, which serves a multitude of downstream applications in the
area of narrative extraction [1]. For example, more accurate temporal tags can be utilized in
timeline summarization [2, 3] or event reasoning [4]. For temporal tagging, too, the largest
resources exist without a doubt for English [5, 6, 7, 8]. While some non-English resources do
exist [9, 10], they are still scarce, and generally smaller than their English counterparts. Despite
attempts to approach the lack of language-speci�c resources through the lens of multilingual
transfer learning [11, 12], Heideltime [13, 14], a rule-based approach extending to multiple
languages, remains state-of-the-art. Yet, rule-based approaches generally su�er from a precision-
heavy tagging, since slight variations on patterns cannot be successfully detected. By applying
state-of-the-art neural models instead, such variations could be covered as well, increasing
the overall tagging performance. However, the lack of available data makes the training of
data-hungry neural models non-trivial. We illustrate a generic transfer pipeline with German
In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story’22 Workshop, Stavanger
(Norway), 10-April-2022
� almasian@informatik.uni-heidelberg.de (S. Almasian); aumiller@informatik.uni-heidelberg.de (D. Aumiller);
gertz@informatik.uni-heidelberg.de (M. Gertz)
                               © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g

                               CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                  83
as an example of a lower-resource language. By using a combination of automatically labeled
data for pre-training and additional translated English data, we boost the amount of available
training data. With this augmented corpus, we are able to �ne-tune Transformer models that
improve temporal tagging performance for German.


2. Related Work
The main reference point for temporal tagging of non-English resources is Heideltime [13,
14], which provides automatically transduced rules for other languages; the coverage varies
depending on the language’s syntactic structure. At the same time, they also provide language-
speci�c rules for a smaller set of languages, including German.
As for datasets, this work relies on the KRAUTS corpus [9], which consists of roughly 1,100
annotations of Tyrolian and German newspaper articles. WikiwarsDE [15] is another German-
speci�c resource, yet, the temporal annotations are not available in the current TIMEX3 format,
limiting their applicability for recent models.
Approaches dealing with German include Lange et al. [11], who experimented with adversar-
ially aligned embeddings. While their method beats the automatically translated rule set of
Heideltime, it falls short of the language-speci�c rule set. With a similar strategy, Starý et al. [12]
�ne-tuned a multilingual version of BERT with OntoNotes data. Both works use KRAUTS data
for evaluation, and have the advantage of automatically scaling to several target languages,
however, at the cost of language-speci�c performance.
Another notable multilingual dataset is TimeBank [16, 17, 18, 19], which covers several languages
including French, Italian, Portuguese and Romanian. Taggers in low-resource settings are
generally limited, but do exist: TipSem [10] and Annotador [20] for Spanish, Bosque-T0 [21]
and the work by Costa and Branco [22] for Portuguese, and PET [23] for Persian.


3. A Transfer Pipeline for Temporal Tagging
Temporal tagging is the task of identi�cation of temporal expression, classi�cation of the type
and sometimes normalization of temporal values. In the work, we focus on identi�cation and
classi�cation of expression in four classes de�ned by TIMEX3 schema, namely DATE, TIME,
SET and DURATION. As previously mentioned, language-speci�c resources tend to perform
better than multilingual approaches. Therefore, we set out to construct a language-speci�c
German tagging approach with the help of Transformer-based language models [24]. We utilize
monolingual language models in this work, opposed to previously utilized multilingual networks.
Speci�cally, Chan et al. [25] present several iterations of German-speci�c Transformer networks;
we choose the best-performing model, which is based on the ELECTRA [26] architecture, namely
GELECTRA-large.
However, successfully employing the Transformer networks requires more data than what is
available in KRAUTS dataset [9]. For this purpose, we create a corpus of automatically tagged
news articles, using Heideltime’s German tagger. This provides around 500,000 temporal
expressions for an additional "pre-training step", exceeding the available German tagging data
by roughly 2,000 times, albeit at a lower guarantee of annotation quality.


                                                  84
Table 1
Statistics of the training resources with TIMEX3 tag distribution. Note that the values for TempEval
refer to tags a�er automated translation. DATE, SET, DURATION, TIME are the temporal types.
                                     #Docs      #Expressions       DATE     SET    DURATION     TIME
  HeideltimeDE train                 64,299          400,824     292,388   2,502      66,867   39,067
  HeideltimeDE test                  14,768           97,981      66,713     634      13,892   16,742
  TempEvalDE train                      256            1,782       1,455     30         251       30
  KRAUTS Dolomiten (train)              142                587      376      19          94       98
  KRAUTS Die Zeit (test)                 50                553      358      39         144       12


We further experiment with automatically translated English data, based on the TempEval-3
corpus [7]. Articles were automatically translated with the help of Google Translate1 , and we
were able to retain about 90% of the original annotations in the German version. See Table 1
for a detailed comparison, including the tag distribution.


4. Experiments
For experimentation, we use the KRAUTS Dolomiten subset as the training set, and the Die Zeit
subset for testing. Further, all models were run on three NVIDIA A100 GPUs using the Adam
optimizer and linear weight decay. Pre-training was performed for 4 epochs, with a learning
rate of 1e-7 and batch size 16 on each GPU and gradient accumulation step of 4, which took
approximately 30 hours. Variants with automatically translated TempEval data were trained an
additional 8 epochs with batch size 16 and learning rate of 5e-5 on a single GPU before the �nal
�ne-tuning on Dolomiten for another 8 epochs. All metrics on �ne-tuned models are averaged
for 3 di�erent random seeds; pre-training was run once without pre-determined random seeds.
We use the o�cial TempEval-3 script for computing results, which also works with German
texts. TempEval generally di�erentiates between partial ("relaxed") and exact ("strict") tagging
overlap.

4.1. Results
Table 2 contains all available results. Note that the adversarially trained model by Lange et
al. [11] has transferred from English data, and seen no explicit German training data, which
explains its lower performance. The mBERT NER model [12] does not perform type classi�cation.
We identify Heideltime as the best-performing baseline system, where its rule-based nature
tends to favor precision over recall.
To investigate the e�ect of continued pre-training, we report results for both o�-the-shelf
variants and additionally pre-trained models (denoted by "p"). Pre-training was performed
on the automatically labeled portion (HeideltimeDE train). "+ temp" denotes �ne-tuning on
translated TempEval data, and "+ dolo" �ne-tuning on Dolomiten data, respectively. For �ne-
tuning on both sets together, we �rst train for 8 epochs on TempEval data, and then for another
8 epochs on Dolomiten.
   1
       translate.google.com, accessed: 2022-01-14


                                                      85
Table 2
Tagging performance on the KRAUTS Die Zeit subset; bold highlights indicate best performance. For
mBERT results [12], it is unclear whether the entire KRAUTS dataset was used instead. Lange et al. [11]
only report F1 scores for their results, which is why the exact precision and recall scores are unknown.
Our own results are averaged across three fine-tuning runs with varying random seeds.
                                           Strict                      Relaxed                Type
   Method                         F-1      Prec.     Recall    F-1      Prec.      Recall      F-1
   Heideltime                    69.72     77.11     63.62    79.30     87.71      72.37      75.38
   Adversarial BERT [11]         66.53       ?         ?      77.82       ?          ?        69.04
   mBERT NER [12]                43.15     53.92     35.96    64.94     64.94      54.13
   GELECTRA + dolo               75.51     73.06     78.13    85.88     83.09      88.87     78.96
   GELECTRA + temp + dolo        70.71     70.52     70.91    84.25     84.01      84.49     75.85
   GELECTRAp                     65.45     71.10     60.64    77.90     84.62      72.17     73.82
   GELECTRAp + dolo              76.13     73.52     78.93    85.33     82.41      88.47     80.06
   GELECTRAp + temp + dolo       75.32     74.03     76.68    86.13     84.65      87.67     79.49


Overall, our best model for relaxed matching (86.13 F1) is GELECTRAp + temp + dolo. However,
it appears that the automatically translated data is somewhat misleading for strict matches;
GELECTRAp + dolo, which is only trained on Dolomiten, has the highest strict match, as well
as best type classi�cation performance. Since the teacher, Heideltime, is precision-focused,
all pre-trained variants also carry slightly higher precision, implying that the choice of weak
labeler for pre-training directly a�ects the �ne-tuning performance as well. Variants without
pre-training are in comparison more recall-oriented. It is worth noting that even without any
�ne-tuning and only pre-training, GELECTRAp manages to perform close to Heideltime in
terms of F1 scores, which also highlights the cross-domain performance of neural methods.
Translations of TempEval data have a deteriorating e�ect on non-pre-trained models. A possible
explanation is that pre-training makes the model more stable and resilient to noisy inputs, which
is likely for automatic translation data. Overall, it can be observed that there is no singular
top-performing model across all metrics. Depending on user preferences, appropriate models
choices can then be made.
We also include results of type classi�cation. Note the highly uneven class distribution, which
is present in all datasets and makes prediction performance for rare classes a challenging
task. Accessing a larger corpora in pre-training also means more frequently encountering rare
class instances, which bene�ts the type prediction in the �nal evaluation. Correspondingly,
pre-trained models outperform their respective model counterparts without pre-training.
Additional training results with GottBERT [27] and GELECTRA-base were omitted for the sake
of brevity, but exhibited a worse performance than the presented models.

4.2. Current Limitations
Preliminary results indicate that our �ne-tuned models are clearly outperforming the baseline
tagger in almost every metric. However, it should be noted that the performance without
pre-training is already quite good and close to the pre-trained variants. Given the cost of
pre-training, this should be considered as a potential trade-o�.


                                                    86
Further, we want to point out the high similarity between German and English. This is particu-
larly relevant for automatically translated resources, where it is much easier to obtain additional
high-quality annotations through automated translation.
Finally, the approach still relies on existing resources for the �nal �ne-tuning, which includes
both existing monolingual models and datasets. However, we suspect multilingual models would
also be suitable after su�cient task-speci�c pre-training, which makes monolingual models
less of a requirement. As for data, the 500 tags used for �ne-tuning seem already su�cient to
learn a decent system on top of a base model, which is promising for other languages without
existing annotations.


5. Conclusion and Future Work
In this work, we have introduced a generic way to �ne-tune language-speci�c temporal taggers,
demonstrated at the example of a German tagger. While there are limitations to the current
approach, we successfully demonstrate surpassing the current state-of-the-art tagger for German,
which is a promising start.
For future work, we are planning to investigate patterns of incorrect labels to determine areas
of improvement, and employ bootstrapping with semi-supervised learning to further increase
the tagging accuracy for precision-heavy model variants.


References
 [1] R. Campos, G. Dias, A. M. Jorge, A. Jatowt, Survey of temporal information retrieval and
     related applications, ACM Comput. Surv. 47 (2014) 15:1–15:41. URL: https://doi.org/10.
     1145/2619088. doi:10.1145/2619088.
 [2] P. Hausner, D. Aumiller, M. Gertz, Time-centric exploration of court documents, in:
     R. Campos, A. M. Jorge, A. Jatowt, S. Bhatia (Eds.), Proceedings of Text2Story - Third
     Workshop on Narrative Extraction From Texts co-located with 42nd European Conference
     on Information Retrieval, Text2Story@ECIR 2020, Lisbon, Portugal, April 14th, 2020 [online
     only], volume 2593 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 31–37. URL:
     http://ceur-ws.org/Vol-2593/paper4.pdf.
 [3] P. Hausner, D. Aumiller, M. Gertz, Ticco: Time-centric content exploration, in: M. d’Aquin,
     S. Dietze, C. Hau�, E. Curry, P. Cudré-Mauroux (Eds.), CIKM ’20: The 29th ACM Interna-
     tional Conference on Information and Knowledge Management, Virtual Event, Ireland,
     October 19-23, 2020, ACM, 2020, pp. 3413–3416. doi:10.1145/3340531.3417432.
 [4] S. Vashishtha, A. Poliak, Y. K. Lal, B. Van Durme, A. S. White, Temporal reasoning in natural
     language inference, in: Findings of the Association for Computational Linguistics: EMNLP
     2020, Association for Computational Linguistics, Online, 2020, pp. 4070–4078. URL: https://
     aclanthology.org/2020.�ndings-emnlp.363. doi:10.18653/v1/2020.findings-emnlp.
     363.
 [5] M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, G. Katz, J. Pustejovsky, SemEval-
     2007 Task 15: TempEval Temporal Relation Identi�cation, in: Proceedings of the
     Fourth International Workshop on Semantic Evaluations (SemEval-2007), Association


                                                87
     for Computational Linguistics, Prague, Czech Republic, 2007, pp. 75–80. URL: https:
     //www.aclweb.org/anthology/S07-1014.
 [6] M. Verhagen, R. Saurí, T. Caselli, J. Pustejovsky, SemEval-2010 task 13: TempEval-2, in:
     Proceedings of the 5th International Workshop on Semantic Evaluation, Association for
     Computational Linguistics, Uppsala, Sweden, 2010, pp. 57–62. URL: https://www.aclweb.
     org/anthology/S10-1010.
 [7] N. UzZaman, H. Llorens, L. Derczynski, J. Allen, M. Verhagen, J. Pustejovsky, SemEval-
     2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations,
     in: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2:
     Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval
     2013), Association for Computational Linguistics, Atlanta, Georgia, USA, 2013, pp. 1–9.
     URL: https://www.aclweb.org/anthology/S13-2001.
 [8] X. Zhong, A. Sun, E. Cambria, Time Expression Analysis and Recognition Using Syntactic
     Token Types and General Heuristic Rules, in: Proceedings of the 55th Annual Meeting
     of the Association for Computational Linguistics (Volume 1: Long Papers), Association
     for Computational Linguistics, Vancouver, Canada, 2017, pp. 420–429. URL: https://www.
     aclweb.org/anthology/P17-1039. doi:10.18653/v1/P17-1039.
 [9] J. Strötgen, A. Minard, L. Lange, M. Speranza, B. Magnini, KRAUTS: A german temporally
     annotated news corpus, in: Proceedings of the Eleventh International Conference on Lan-
     guage Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European
     Language Resources Association (ELRA), 2018.
[10] H. Llorens, E. Saquete, B. Navarro, Tipsem (english and spanish): Evaluating crfs and
     semantic roles in tempeval-2, in: Proceedings of the 5th International Workshop on
     Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July
     15-16, 2010, The Association for Computer Linguistics, 2010, pp. 284–291. URL: https:
     //aclanthology.org/S10-1063/.
[11] L. Lange, A. Iurshina, H. Adel, J. Strötgen, Adversarial Alignment of Multilingual Models
     for Extracting Temporal Expressions from Text, in: Proceedings of the 5th Workshop on
     Representation Learning for NLP, Association for Computational Linguistics, Online, 2020,
     pp. 103–109. URL: https://www.aclweb.org/anthology/2020.repl4nlp-1.14. doi:10.18653/
     v1/2020.repl4nlp-1.14.
[12] M. Starý, Z. Neverilová, J. Valcík, Multilingual recognition of temporal expressions, in: The
     14th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN
     2020, Brno (on-line), Czech Republic, December 8-10, 2020, Tribun EU, 2020, pp. 67–78.
     URL: http://nlp.�.muni.cz/raslan/2020/paper2.pdf.
[13] J. Strötgen, M. Gertz, HeidelTime: High Quality Rule-Based Extraction and Normalization
     of Temporal Expressions, in: Proceedings of the 5th International Workshop on Semantic
     Evaluation, Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 321–324.
     URL: https://www.aclweb.org/anthology/S10-1071.
[14] J. Strötgen, M. Gertz, A Baseline Temporal Tagger for all Languages, in: Proceedings of
     the 2015 Conference on Empirical Methods in Natural Language Processing, Association
     for Computational Linguistics, Lisbon, Portugal, 2015, pp. 541–547. URL: https://www.
     aclweb.org/anthology/D15-1063. doi:10.18653/v1/D15-1063.
[15] J. Strötgen, M. Gertz, Wikiwarsde: A german corpus of narratives annotated with temporal


                                               88
     expressions, in: Proceedings of the conference of the German society for computational
     linguistics and language technology (GSCL 2011), Citeseer, 2011, pp. 129–134.
[16] A. Bittar, P. Amsili, P. Denis, L. Danlos, French timebank: An iso-timeml annotated
     reference corpus, in: The 49th Annual Meeting of the Association for Computational
     Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June,
     2011, Portland, Oregon, USA - Short Papers, The Association for Computer Linguistics,
     2011, pp. 130–134. URL: https://aclanthology.org/P11-2023/.
[17] T. Caselli, V. B. Lenzi, R. Sprugnoli, E. Pianta, I. Prodanof, Annotating events, temporal
     expressions and relations in italian: the it-timeml experience for the ita-timebank, in:
     Proceedings of the Fifth Linguistic Annotation Workshop, LAW 2011, June 23-24, 2011,
     Portland, Oregon, USA, The Association for Computer Linguistics, 2011, pp. 143–151. URL:
     https://aclanthology.org/W11-0418/.
[18] F. Costa, A. Branco, Timebankpt: A timeml annotated corpus of portuguese, in: Proceedings
     of the Eighth International Conference on Language Resources and Evaluation, LREC 2012,
     Istanbul, Turkey, May 23-25, 2012, European Language Resources Association (ELRA), 2012,
     pp. 3727–3734. URL: http://www.lrec-conf.org/proceedings/lrec2012/summaries/246.html.
[19] C. Forascu, D. Tu�s, Romanian timebank: An annotated parallel corpus for temporal infor-
     mation, in: Proceedings of the Eighth International Conference on Language Resources and
     Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, European Language Resources
     Association (ELRA), 2012, pp. 3762–3766. URL: http://www.lrec-conf.org/proceedings/
     lrec2012/summaries/770.html.
[20] M. Navas-Loro, V. Rodríguez-Doncel, Annotador: a temporal tagger for spanish, J. Intell.
     Fuzzy Syst. 39 (2020) 1979–1991. URL: https://doi.org/10.3233/JIFS-179865. doi:10.3233/
     JIFS-179865.
[21] L. Real, A. Rademaker, F. Chalub, V. de Paiva, Towards temporal reasoning in portuguese,
     in: Proceedings of the LREC2018 Workshop Linked Data in Linguistics, 2018.
[22] F. Costa, A. Branco, Extracting temporal information from portuguese texts, in:
     Computational Processing of the Portuguese Language - 10th International Conference,
     PROPOR 2012, Coimbra, Portugal, April 17-20, 2012. Proceedings, volume 7243 of Lec-
     ture Notes in Computer Science, Springer, 2012, pp. 99–105. URL: https://doi.org/10.1007/
     978-3-642-28885-2_11. doi:10.1007/978-3-642-28885-2\_11.
[23] Y. Yaghoobzadeh, G. Ghassem-Sani, S. A. Mirroshandel, M. Eshaghzadeh, Iso-timeml event
     extraction in persian text, in: COLING 2012, 24th International Conference on Compu-
     tational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December
     2012, Mumbai, India, Indian Institute of Technology Bombay, 2012, pp. 2931–2944. URL:
     https://aclanthology.org/C12-1179/.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach,
     R. Fergus, S. V. N. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Pro-
     cessing Systems 30: Annual Conference on Neural Information Processing Systems 2017,
     December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. URL: https://proceedings.
     neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[25] B. Chan, S. Schweter, T. Möller, German’s next language model, in: Proceedings of the
     28th International Conference on Computational Linguistics, International Committee on


                                              89
     Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 6788–6796. URL: https:
     //aclanthology.org/2020.coling-main.598. doi:10.18653/v1/2020.coling-main.598.
[26] K. Clark, M. Luong, Q. V. Le, C. D. Manning, ELECTRA: pre-training text encoders as
     discriminators rather than generators, in: 8th International Conference on Learning
     Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net,
     2020. URL: https://openreview.net/forum?id=r1xMH1BtvB.
[27] R. Scheible, F. Thomczyk, P. Tippmann, V. Jaravine, M. Boeker, GottBERT: a pure German
     Language Model, arXiv preprint arXiv:2012.02110 (2020).


                                            90