=Paper=
{{Paper
|id=Vol-3117/paper9
|storemode=property
|title=Time for some German? Pre-Training a Transformer-based Temporal Tagger for German
|pdfUrl=https://ceur-ws.org/Vol-3117/paper9.pdf
|volume=Vol-3117
|authors=Satya Almasian,Dennis Aumiller,Michael Gertz
|dblpUrl=https://dblp.org/rec/conf/ecir/AlmasianA022
}}
==Time for some German? Pre-Training a Transformer-based Temporal Tagger for German==
Time for some German? Pre-Training a
Transformer-based Temporal Tagger for German
Satya Almasian1,0 , Dennis Aumiller1,0 and Michael Gertz1
1
Institute of Computer Science, Heidelberg University, Heidelberg, Germany
0
These authors contributed equally to this work
Abstract
Non-English languages are notorious for their lack of available resources, and temporal tagging is no
exception. In this work, we explore transfer strategies to improve the quality of a German temporal
tagger. From a model perspective, we employ a weakly-supervised pre-training strategy to stabilize
the convergence of Transformer-based taggers. In addition, we also augment data with automatically
translated English resources, which serve as an alternative to commonly used alignments of latent em-
bedding spaces. With this, we provide preliminary empirical evidence that indicates the suitability of
transfer approaches to other low-resourced languages: A small number of gold data coupled with an
existing data set in a resource-rich language and a weak labeling baseline system may be su�cient to
boost performance.
Keywords
Temporal tagging, Weakly-supervised learning, German
1. Introduction
Annotated data has become an essential part of modern-day NLP approaches, but non-English
resources remain scarce. In the absence of data, it then becomes increasingly di�cult to even
transfer existing approaches to a multilingual context. In this work, we particularly focus on
the task of Temporal Tagging, which serves a multitude of downstream applications in the
area of narrative extraction [1]. For example, more accurate temporal tags can be utilized in
timeline summarization [2, 3] or event reasoning [4]. For temporal tagging, too, the largest
resources exist without a doubt for English [5, 6, 7, 8]. While some non-English resources do
exist [9, 10], they are still scarce, and generally smaller than their English counterparts. Despite
attempts to approach the lack of language-speci�c resources through the lens of multilingual
transfer learning [11, 12], Heideltime [13, 14], a rule-based approach extending to multiple
languages, remains state-of-the-art. Yet, rule-based approaches generally su�er from a precision-
heavy tagging, since slight variations on patterns cannot be successfully detected. By applying
state-of-the-art neural models instead, such variations could be covered as well, increasing
the overall tagging performance. However, the lack of available data makes the training of
data-hungry neural models non-trivial. We illustrate a generic transfer pipeline with German
In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story’22 Workshop, Stavanger
(Norway), 10-April-2022
� almasian@informatik.uni-heidelberg.de (S. Almasian); aumiller@informatik.uni-heidelberg.de (D. Aumiller);
gertz@informatik.uni-heidelberg.de (M. Gertz)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Wor
Pr
ks
hop
oceedi
ngs
ht
I
tp:
//
ceur
-
SSN1613-
ws
.or
0073
g
CEUR Workshop Proceedings (CEUR-WS.org)
83
as an example of a lower-resource language. By using a combination of automatically labeled
data for pre-training and additional translated English data, we boost the amount of available
training data. With this augmented corpus, we are able to �ne-tune Transformer models that
improve temporal tagging performance for German.
2. Related Work
The main reference point for temporal tagging of non-English resources is Heideltime [13,
14], which provides automatically transduced rules for other languages; the coverage varies
depending on the language’s syntactic structure. At the same time, they also provide language-
speci�c rules for a smaller set of languages, including German.
As for datasets, this work relies on the KRAUTS corpus [9], which consists of roughly 1,100
annotations of Tyrolian and German newspaper articles. WikiwarsDE [15] is another German-
speci�c resource, yet, the temporal annotations are not available in the current TIMEX3 format,
limiting their applicability for recent models.
Approaches dealing with German include Lange et al. [11], who experimented with adversar-
ially aligned embeddings. While their method beats the automatically translated rule set of
Heideltime, it falls short of the language-speci�c rule set. With a similar strategy, Starý et al. [12]
�ne-tuned a multilingual version of BERT with OntoNotes data. Both works use KRAUTS data
for evaluation, and have the advantage of automatically scaling to several target languages,
however, at the cost of language-speci�c performance.
Another notable multilingual dataset is TimeBank [16, 17, 18, 19], which covers several languages
including French, Italian, Portuguese and Romanian. Taggers in low-resource settings are
generally limited, but do exist: TipSem [10] and Annotador [20] for Spanish, Bosque-T0 [21]
and the work by Costa and Branco [22] for Portuguese, and PET [23] for Persian.
3. A Transfer Pipeline for Temporal Tagging
Temporal tagging is the task of identi�cation of temporal expression, classi�cation of the type
and sometimes normalization of temporal values. In the work, we focus on identi�cation and
classi�cation of expression in four classes de�ned by TIMEX3 schema, namely DATE, TIME,
SET and DURATION. As previously mentioned, language-speci�c resources tend to perform
better than multilingual approaches. Therefore, we set out to construct a language-speci�c
German tagging approach with the help of Transformer-based language models [24]. We utilize
monolingual language models in this work, opposed to previously utilized multilingual networks.
Speci�cally, Chan et al. [25] present several iterations of German-speci�c Transformer networks;
we choose the best-performing model, which is based on the ELECTRA [26] architecture, namely
GELECTRA-large.
However, successfully employing the Transformer networks requires more data than what is
available in KRAUTS dataset [9]. For this purpose, we create a corpus of automatically tagged
news articles, using Heideltime’s German tagger. This provides around 500,000 temporal
expressions for an additional "pre-training step", exceeding the available German tagging data
by roughly 2,000 times, albeit at a lower guarantee of annotation quality.
84
Table 1
Statistics of the training resources with TIMEX3 tag distribution. Note that the values for TempEval
refer to tags a�er automated translation. DATE, SET, DURATION, TIME are the temporal types.
#Docs #Expressions DATE SET DURATION TIME
HeideltimeDE train 64,299 400,824 292,388 2,502 66,867 39,067
HeideltimeDE test 14,768 97,981 66,713 634 13,892 16,742
TempEvalDE train 256 1,782 1,455 30 251 30
KRAUTS Dolomiten (train) 142 587 376 19 94 98
KRAUTS Die Zeit (test) 50 553 358 39 144 12
We further experiment with automatically translated English data, based on the TempEval-3
corpus [7]. Articles were automatically translated with the help of Google Translate1 , and we
were able to retain about 90% of the original annotations in the German version. See Table 1
for a detailed comparison, including the tag distribution.
4. Experiments
For experimentation, we use the KRAUTS Dolomiten subset as the training set, and the Die Zeit
subset for testing. Further, all models were run on three NVIDIA A100 GPUs using the Adam
optimizer and linear weight decay. Pre-training was performed for 4 epochs, with a learning
rate of 1e-7 and batch size 16 on each GPU and gradient accumulation step of 4, which took
approximately 30 hours. Variants with automatically translated TempEval data were trained an
additional 8 epochs with batch size 16 and learning rate of 5e-5 on a single GPU before the �nal
�ne-tuning on Dolomiten for another 8 epochs. All metrics on �ne-tuned models are averaged
for 3 di�erent random seeds; pre-training was run once without pre-determined random seeds.
We use the o�cial TempEval-3 script for computing results, which also works with German
texts. TempEval generally di�erentiates between partial ("relaxed") and exact ("strict") tagging
overlap.
4.1. Results
Table 2 contains all available results. Note that the adversarially trained model by Lange et
al. [11] has transferred from English data, and seen no explicit German training data, which
explains its lower performance. The mBERT NER model [12] does not perform type classi�cation.
We identify Heideltime as the best-performing baseline system, where its rule-based nature
tends to favor precision over recall.
To investigate the e�ect of continued pre-training, we report results for both o�-the-shelf
variants and additionally pre-trained models (denoted by "p"). Pre-training was performed
on the automatically labeled portion (HeideltimeDE train). "+ temp" denotes �ne-tuning on
translated TempEval data, and "+ dolo" �ne-tuning on Dolomiten data, respectively. For �ne-
tuning on both sets together, we �rst train for 8 epochs on TempEval data, and then for another
8 epochs on Dolomiten.
1
translate.google.com, accessed: 2022-01-14
85
Table 2
Tagging performance on the KRAUTS Die Zeit subset; bold highlights indicate best performance. For
mBERT results [12], it is unclear whether the entire KRAUTS dataset was used instead. Lange et al. [11]
only report F1 scores for their results, which is why the exact precision and recall scores are unknown.
Our own results are averaged across three fine-tuning runs with varying random seeds.
Strict Relaxed Type
Method F-1 Prec. Recall F-1 Prec. Recall F-1
Heideltime 69.72 77.11 63.62 79.30 87.71 72.37 75.38
Adversarial BERT [11] 66.53 ? ? 77.82 ? ? 69.04
mBERT NER [12] 43.15 53.92 35.96 64.94 64.94 54.13
GELECTRA + dolo 75.51 73.06 78.13 85.88 83.09 88.87 78.96
GELECTRA + temp + dolo 70.71 70.52 70.91 84.25 84.01 84.49 75.85
GELECTRAp 65.45 71.10 60.64 77.90 84.62 72.17 73.82
GELECTRAp + dolo 76.13 73.52 78.93 85.33 82.41 88.47 80.06
GELECTRAp + temp + dolo 75.32 74.03 76.68 86.13 84.65 87.67 79.49
Overall, our best model for relaxed matching (86.13 F1) is GELECTRAp + temp + dolo. However,
it appears that the automatically translated data is somewhat misleading for strict matches;
GELECTRAp + dolo, which is only trained on Dolomiten, has the highest strict match, as well
as best type classi�cation performance. Since the teacher, Heideltime, is precision-focused,
all pre-trained variants also carry slightly higher precision, implying that the choice of weak
labeler for pre-training directly a�ects the �ne-tuning performance as well. Variants without
pre-training are in comparison more recall-oriented. It is worth noting that even without any
�ne-tuning and only pre-training, GELECTRAp manages to perform close to Heideltime in
terms of F1 scores, which also highlights the cross-domain performance of neural methods.
Translations of TempEval data have a deteriorating e�ect on non-pre-trained models. A possible
explanation is that pre-training makes the model more stable and resilient to noisy inputs, which
is likely for automatic translation data. Overall, it can be observed that there is no singular
top-performing model across all metrics. Depending on user preferences, appropriate models
choices can then be made.
We also include results of type classi�cation. Note the highly uneven class distribution, which
is present in all datasets and makes prediction performance for rare classes a challenging
task. Accessing a larger corpora in pre-training also means more frequently encountering rare
class instances, which bene�ts the type prediction in the �nal evaluation. Correspondingly,
pre-trained models outperform their respective model counterparts without pre-training.
Additional training results with GottBERT [27] and GELECTRA-base were omitted for the sake
of brevity, but exhibited a worse performance than the presented models.
4.2. Current Limitations
Preliminary results indicate that our �ne-tuned models are clearly outperforming the baseline
tagger in almost every metric. However, it should be noted that the performance without
pre-training is already quite good and close to the pre-trained variants. Given the cost of
pre-training, this should be considered as a potential trade-o�.
86
Further, we want to point out the high similarity between German and English. This is particu-
larly relevant for automatically translated resources, where it is much easier to obtain additional
high-quality annotations through automated translation.
Finally, the approach still relies on existing resources for the �nal �ne-tuning, which includes
both existing monolingual models and datasets. However, we suspect multilingual models would
also be suitable after su�cient task-speci�c pre-training, which makes monolingual models
less of a requirement. As for data, the 500 tags used for �ne-tuning seem already su�cient to
learn a decent system on top of a base model, which is promising for other languages without
existing annotations.
5. Conclusion and Future Work
In this work, we have introduced a generic way to �ne-tune language-speci�c temporal taggers,
demonstrated at the example of a German tagger. While there are limitations to the current
approach, we successfully demonstrate surpassing the current state-of-the-art tagger for German,
which is a promising start.
For future work, we are planning to investigate patterns of incorrect labels to determine areas
of improvement, and employ bootstrapping with semi-supervised learning to further increase
the tagging accuracy for precision-heavy model variants.
References
[1] R. Campos, G. Dias, A. M. Jorge, A. Jatowt, Survey of temporal information retrieval and
related applications, ACM Comput. Surv. 47 (2014) 15:1–15:41. URL: https://doi.org/10.
1145/2619088. doi:10.1145/2619088.
[2] P. Hausner, D. Aumiller, M. Gertz, Time-centric exploration of court documents, in:
R. Campos, A. M. Jorge, A. Jatowt, S. Bhatia (Eds.), Proceedings of Text2Story - Third
Workshop on Narrative Extraction From Texts co-located with 42nd European Conference
on Information Retrieval, Text2Story@ECIR 2020, Lisbon, Portugal, April 14th, 2020 [online
only], volume 2593 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 31–37. URL:
http://ceur-ws.org/Vol-2593/paper4.pdf.
[3] P. Hausner, D. Aumiller, M. Gertz, Ticco: Time-centric content exploration, in: M. d’Aquin,
S. Dietze, C. Hau�, E. Curry, P. Cudré-Mauroux (Eds.), CIKM ’20: The 29th ACM Interna-
tional Conference on Information and Knowledge Management, Virtual Event, Ireland,
October 19-23, 2020, ACM, 2020, pp. 3413–3416. doi:10.1145/3340531.3417432.
[4] S. Vashishtha, A. Poliak, Y. K. Lal, B. Van Durme, A. S. White, Temporal reasoning in natural
language inference, in: Findings of the Association for Computational Linguistics: EMNLP
2020, Association for Computational Linguistics, Online, 2020, pp. 4070–4078. URL: https://
aclanthology.org/2020.�ndings-emnlp.363. doi:10.18653/v1/2020.findings-emnlp.
363.
[5] M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, G. Katz, J. Pustejovsky, SemEval-
2007 Task 15: TempEval Temporal Relation Identi�cation, in: Proceedings of the
Fourth International Workshop on Semantic Evaluations (SemEval-2007), Association
87
for Computational Linguistics, Prague, Czech Republic, 2007, pp. 75–80. URL: https:
//www.aclweb.org/anthology/S07-1014.
[6] M. Verhagen, R. Saurí, T. Caselli, J. Pustejovsky, SemEval-2010 task 13: TempEval-2, in:
Proceedings of the 5th International Workshop on Semantic Evaluation, Association for
Computational Linguistics, Uppsala, Sweden, 2010, pp. 57–62. URL: https://www.aclweb.
org/anthology/S10-1010.
[7] N. UzZaman, H. Llorens, L. Derczynski, J. Allen, M. Verhagen, J. Pustejovsky, SemEval-
2013 Task 1: TempEval-3: Evaluating Time Expressions, Events, and Temporal Relations,
in: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2:
Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval
2013), Association for Computational Linguistics, Atlanta, Georgia, USA, 2013, pp. 1–9.
URL: https://www.aclweb.org/anthology/S13-2001.
[8] X. Zhong, A. Sun, E. Cambria, Time Expression Analysis and Recognition Using Syntactic
Token Types and General Heuristic Rules, in: Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), Association
for Computational Linguistics, Vancouver, Canada, 2017, pp. 420–429. URL: https://www.
aclweb.org/anthology/P17-1039. doi:10.18653/v1/P17-1039.
[9] J. Strötgen, A. Minard, L. Lange, M. Speranza, B. Magnini, KRAUTS: A german temporally
annotated news corpus, in: Proceedings of the Eleventh International Conference on Lan-
guage Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018, European
Language Resources Association (ELRA), 2018.
[10] H. Llorens, E. Saquete, B. Navarro, Tipsem (english and spanish): Evaluating crfs and
semantic roles in tempeval-2, in: Proceedings of the 5th International Workshop on
Semantic Evaluation, SemEval@ACL 2010, Uppsala University, Uppsala, Sweden, July
15-16, 2010, The Association for Computer Linguistics, 2010, pp. 284–291. URL: https:
//aclanthology.org/S10-1063/.
[11] L. Lange, A. Iurshina, H. Adel, J. Strötgen, Adversarial Alignment of Multilingual Models
for Extracting Temporal Expressions from Text, in: Proceedings of the 5th Workshop on
Representation Learning for NLP, Association for Computational Linguistics, Online, 2020,
pp. 103–109. URL: https://www.aclweb.org/anthology/2020.repl4nlp-1.14. doi:10.18653/
v1/2020.repl4nlp-1.14.
[12] M. Starý, Z. Neverilová, J. Valcík, Multilingual recognition of temporal expressions, in: The
14th Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN
2020, Brno (on-line), Czech Republic, December 8-10, 2020, Tribun EU, 2020, pp. 67–78.
URL: http://nlp.�.muni.cz/raslan/2020/paper2.pdf.
[13] J. Strötgen, M. Gertz, HeidelTime: High Quality Rule-Based Extraction and Normalization
of Temporal Expressions, in: Proceedings of the 5th International Workshop on Semantic
Evaluation, Association for Computational Linguistics, Uppsala, Sweden, 2010, pp. 321–324.
URL: https://www.aclweb.org/anthology/S10-1071.
[14] J. Strötgen, M. Gertz, A Baseline Temporal Tagger for all Languages, in: Proceedings of
the 2015 Conference on Empirical Methods in Natural Language Processing, Association
for Computational Linguistics, Lisbon, Portugal, 2015, pp. 541–547. URL: https://www.
aclweb.org/anthology/D15-1063. doi:10.18653/v1/D15-1063.
[15] J. Strötgen, M. Gertz, Wikiwarsde: A german corpus of narratives annotated with temporal
88
expressions, in: Proceedings of the conference of the German society for computational
linguistics and language technology (GSCL 2011), Citeseer, 2011, pp. 129–134.
[16] A. Bittar, P. Amsili, P. Denis, L. Danlos, French timebank: An iso-timeml annotated
reference corpus, in: The 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, Proceedings of the Conference, 19-24 June,
2011, Portland, Oregon, USA - Short Papers, The Association for Computer Linguistics,
2011, pp. 130–134. URL: https://aclanthology.org/P11-2023/.
[17] T. Caselli, V. B. Lenzi, R. Sprugnoli, E. Pianta, I. Prodanof, Annotating events, temporal
expressions and relations in italian: the it-timeml experience for the ita-timebank, in:
Proceedings of the Fifth Linguistic Annotation Workshop, LAW 2011, June 23-24, 2011,
Portland, Oregon, USA, The Association for Computer Linguistics, 2011, pp. 143–151. URL:
https://aclanthology.org/W11-0418/.
[18] F. Costa, A. Branco, Timebankpt: A timeml annotated corpus of portuguese, in: Proceedings
of the Eighth International Conference on Language Resources and Evaluation, LREC 2012,
Istanbul, Turkey, May 23-25, 2012, European Language Resources Association (ELRA), 2012,
pp. 3727–3734. URL: http://www.lrec-conf.org/proceedings/lrec2012/summaries/246.html.
[19] C. Forascu, D. Tu�s, Romanian timebank: An annotated parallel corpus for temporal infor-
mation, in: Proceedings of the Eighth International Conference on Language Resources and
Evaluation, LREC 2012, Istanbul, Turkey, May 23-25, 2012, European Language Resources
Association (ELRA), 2012, pp. 3762–3766. URL: http://www.lrec-conf.org/proceedings/
lrec2012/summaries/770.html.
[20] M. Navas-Loro, V. Rodríguez-Doncel, Annotador: a temporal tagger for spanish, J. Intell.
Fuzzy Syst. 39 (2020) 1979–1991. URL: https://doi.org/10.3233/JIFS-179865. doi:10.3233/
JIFS-179865.
[21] L. Real, A. Rademaker, F. Chalub, V. de Paiva, Towards temporal reasoning in portuguese,
in: Proceedings of the LREC2018 Workshop Linked Data in Linguistics, 2018.
[22] F. Costa, A. Branco, Extracting temporal information from portuguese texts, in:
Computational Processing of the Portuguese Language - 10th International Conference,
PROPOR 2012, Coimbra, Portugal, April 17-20, 2012. Proceedings, volume 7243 of Lec-
ture Notes in Computer Science, Springer, 2012, pp. 99–105. URL: https://doi.org/10.1007/
978-3-642-28885-2_11. doi:10.1007/978-3-642-28885-2\_11.
[23] Y. Yaghoobzadeh, G. Ghassem-Sani, S. A. Mirroshandel, M. Eshaghzadeh, Iso-timeml event
extraction in persian text, in: COLING 2012, 24th International Conference on Compu-
tational Linguistics, Proceedings of the Conference: Technical Papers, 8-15 December
2012, Mumbai, India, Indian Institute of Technology Bombay, 2012, pp. 2931–2944. URL:
https://aclanthology.org/C12-1179/.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo-
sukhin, Attention is all you need, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach,
R. Fergus, S. V. N. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Pro-
cessing Systems 30: Annual Conference on Neural Information Processing Systems 2017,
December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. URL: https://proceedings.
neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
[25] B. Chan, S. Schweter, T. Möller, German’s next language model, in: Proceedings of the
28th International Conference on Computational Linguistics, International Committee on
89
Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 6788–6796. URL: https:
//aclanthology.org/2020.coling-main.598. doi:10.18653/v1/2020.coling-main.598.
[26] K. Clark, M. Luong, Q. V. Le, C. D. Manning, ELECTRA: pre-training text encoders as
discriminators rather than generators, in: 8th International Conference on Learning
Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, OpenReview.net,
2020. URL: https://openreview.net/forum?id=r1xMH1BtvB.
[27] R. Scheible, F. Thomczyk, P. Tippmann, V. Jaravine, M. Boeker, GottBERT: a pure German
Language Model, arXiv preprint arXiv:2012.02110 (2020).
90