KLUMSy@KIPoS: Experiments on Part-of-Speech Tagging of Spoken
                            Italian
                 Thomas Proisl                         Gabriella Lapesa
    Computational Corpus Linguistics Group Institute for Natural Language Processing
Friedrich-Alexander-Universität Erlangen-Nürnberg     Universität Stuttgart
                 Bismarckstr. 6                       Pfaffenwaldring 5 b
            91054 Erlangen, Germany                70569 Stuttgart, Germany
             thomas.proisl@fau.de         gabriella.lapesa@ims.uni-stuttgart.de

                       Abstract                               leverage existing written standard corpora to pre-
                                                              train an out-of-domain tagger model and to then
    In this paper, we describe experiments on                 adapt that model to the target domain using a small
    part-of-speech tagging of spoken Italian                  amount of in-domain data.
    that we conducted in the context of the                      We experiment with these ideas in the con-
    EVALITA 2020 KIPoS shared task (Bosco                     text of the EVALITA 2020 shared task on part-
    et al., 2020). Our submission to the shared               of-speech tagging of spoken Italian (Bosco et al.,
    task is based on SoMeWeTa (Proisl, 2018),                 2020; Basile et al., 2020). The data of the shared
    a tagger which supports domain adapta-                    task have been drawn from the KIParla corpus
    tion and is designed to flexibly incorpo-                 (Mauri et al., 2019) and consist of the manually
    rate external resources. We document our                  annotated training and test datasets and a silver
    approach and discuss our results in the                   dataset that has been automatically tagged by the
    shared task along with a statistical analysis             task organizers using a UDPipe1 model trained
    of the factors which impact performance                   on all Italian treebanks in the Universal Depen-
    the most. Additionally, we report on a                    dencies (UD) project.2 While the silver dataset
    set of additional experiments involving the               is annotated with the standard UD tagset (as are
    combination of neural language models                     the corpora on which the tagger has been trained),
    with unsupervised HMMs, and compare                       the training and test sets use an extended version
    its performance to that of our system.                    where tags can optionally be assigned one of two
                                                              subcategories, .DIA for dialectal forms and .LIN
1   Introduction                                              for foreign words.
Part-of-speech taggers trained on standard news-              2     Additional resources
paper texts usually perform relatively poorly on
spoken language or on written communication that              2.1    Corpora
is “conceptually oral”, e. g. tweets or chat mes-             We use a collection of plain text corpora to com-
sages. The challenges of spoken language in-                  pute Brown clusters (Brown et al., 1992) that the
clude non-standard lexis, e. g. the use of collo-             tagger can use as additional resource.
quial and dialectal forms, and non-standard syn-                  Ideally, we would use large amounts of tran-
tax, e. g. false starts, repetitions, incomplete sen-         scribed speech for the present task. Since there
tences and the use of fillers. To make things worse,          is no such dataset, we try to use corpora that come
the amount of training data available for spoken              close. The closest to authentic speech is scripted
language – or non-standard varieties in general –             speech, therefore we use the Italian movie sub-
is usually several orders of magnitude smaller than           titles from the OpenSubtitles corpus (Lison and
for the usual newspaper corpora. One strategy for             Tiedemann, 2016).3 Computer-mediated com-
coping with this is to incorporate additional re-             munication, e. g. in social media, sometimes ex-
sources, e. g. lexica or distributional information           hibits features that are typical of spoken lan-
obtained from large amounts of unannotated text.              guage use. Therefore, we also use a collec-
Another strategy is to do domain adaptation, i. e. to         tion of roughly 11.7 million Italian tweets and
                                                              1 http://ufal.mff.cuni.cz/udpipe/1
Copyright © 2020 for this paper by its authors. Use permit-
                                                              2 https://universaldependencies.org/
ted under Creative Commons License Attribution 4.0 Inter-
national (CC BY 4.0).                                         3 http://opus.nlpl.eu/OpenSubtitles-v2018.php
ca. 2.7 million Reddit posts (submissions and         forms that correspond to about 35,000 lemmata.
comments) from the years 2011–2018. We ex-            In its analyses, Morph-it! distinguishes between
tracted the Reddit posts from Jason Baumgart-         derivational features and inflectional features. In
ner’s collection of Reddit submissions and com-       total, there are 664 unique feature combinations.
ments4 using the processing pipeline by Blom-         We simplify the analyses by stripping away all
bach et al. (2020). Additionally, we also include     inflectional features and some of the derivational
all Italian corpora from the Universal Dependen-      features, i. e. gender (for articles, nouns and pro-
cies project and, to further increase the amount of   nouns) and person and number (for pronouns).
data, a number of web corpora: The PAISÀ cor-         This results in 39 coarse-grained categories that
pus of Italian texts from the web (Lyding et al.,     correspond to major word classes, with some finer
2014),5 the text of the Italian Wikimedia dumps,6     distinctions for determiners and pronouns.
i. e. Wiki(pedia|books|news|versity|voyage), as ex-
tracted by Wikipedia Extractor,7 and the Italian
subset of OSCAR, a huge multilingual Common           3   System description
Crawl corpus (Ortiz Suárez et al., 2019).8
    We tokenize and sentence split all corpora us-    For our submission to the shared task we use
ing UDPipe trained on the union of all Italian UD     SoMeWeTa (Proisl, 2018), a tagger that is based
corpora. We also remove all duplicate sentences.      on the averaged structured perceptron, supports
The sizes of the resulting corpora are given in Ta-   domain adaptation and can incorporate external re-
ble 1. As final preprocessing steps, we lowercase     sources such as Brown clusters and lexica.11 Its
all words and normalize numbers, user mentions,       ability to make use of existing linguistic resources
email addresses and URLs. Finally, we use the im-     allows the tagger to achieve competitive results
plementation by Liang (2005)9 to compute 1,000        even with relatively small amounts of in-domain
Brown clusters with a minimum frequency 5.            training data, which is particularly useful for non-
                                                      standard varieties or under-resourced languages
      corpus            complete      deduplicated    (Kabashi and Proisl, 2018; Proisl et al., 2019).
      oscar                     –   13,787,307,218       We participate in all three substasks: The main
      opensubtitles   795,250,711      378,348,061
      paisa           282,631,297      258,679,965    subtask where we use all the available silver and
      reddit          112,735,958      105,274,620    training data, subtask A where we only use the
      tweets          152,496,728      148,031,020    data from the formal register, and subtask B where
      ud                  672,929          615,057
      wiki            578,425,024      560,863,691    we only use the informal data. The training
      wikibooks        12,106,499       11,825,870    scheme is the same for all three subtasks. First,
      wikinews          2,744,317        2,583,135
      wikiversity       5,766,859        5,365,924
                                                      we train preliminary models on the silver data pro-
      wikivoyage        3,911,881        3,825,872    vided by task organizers. Keep in mind that the sil-
                                                      ver dataset has been automatically tagged. There-
Table 1: Sizes of the additional corpora in tokens.   fore, it is annotated with the standard version of
OSCAR is already deduplicated on the line level.      the UD tagset and not with the extended one that is
                                                      used in the shared task; in addition, there will be a
                                                      certain amount of tagging errors in the data. Nev-
2.2   Morphological lexicon
                                                      ertheless, the dataset provides the tagger with (im-
We incorporate linguistic knowledge in the form       perfect) domain-specific background knowledge.
of Morph-it! (Zanchetta and Baroni, 2005),10 a        In the next step, we adapt the silver models to the
morphological lexicon for Italian that contains       union of the Italian UD treebanks, i. e. to high-
morphological analyses of roughly 505,000 word        quality but out-of-domain data. In the final step,
4 https://files.pushshift.io/reddit/                  we adapt the models to spoken Italian using the
5 http://www.corpusitaliano.it/                       manually annotated training data. In every step we
6 https://dumps.wikimedia.org/
                                                      train for 12 iterations using a search beam size of
7 http://medialab.di.unipi.it/wiki/Wikipedia_
                                                      10 and provide the tagger with the Brown clusters
Extractor
8 https://oscar-corpus.com/                           and the Morph-it!-based lexicon (Section 2).
9 https://github.com/percyliang/brown-cluster/
10 https://docs.sslmit.unibo.it/doku.php?id=
 resources:morph-it                                   11 https://github.com/tsproisl/SoMeWeTa
4     Evaluation                                                 variable and the different experimental parame-
                                                                 ters as independent variables (predictors). We fol-
4.1     Data preparation and evaluation results
                                                                 low the methodology outlined in Lapesa and Ev-
The silver data, training data and the data from the             ert (2014) and quantify the impact of a specific
UD treebanks follow UD tokenization guidelines,                  predictor (e. g. the use of Brown clusters) as the
i. e. contractions such as parlarmi (parlar+mi) ‘to              amount of variance in the dependent variable (tag-
talk+to me’ or della (di+la) ‘of+the’ are split into             ging accuracy) it accounts for. We considered the
their constituents for annotation. This is not the               following experimental parameters as predictors.
case for the test data where contractions have to                • setup: Training/test setup; this predictor en-
be assigned a joint tag, e. g. VERB_PRON or                         codes the combination of training/test data
ADP_A. Therefore, we run the test data through                      and has the following values: all_formal (i. e.
the UDPipe tokenizer from Section 2.1, tag the re-                  trained on the full set, tested on formal),
sulting tokens and merge the tags for all tokens                    all_informal, formal_formal, formal_informal,
that have been split. Table 2 shows the results on                  informal_formal, informal_informal
the two testsets.12 On the main task, SoMeWeTa                   • silver: Use of silver data during training (yes,
performs reasonably well, only 1–1.4 points worse                   no)
than the fine-tuned UmBERTo model by Tam-                        • ud: Use of UD corpora during training (yes, no)
burini (2020). On subtasks A and B, it even out-                 • morph: Use of Morph-it! (yes, no)
performs that system by a considerable margin.                   • brown: Use of Brown clusters (yes, no)
                                                                 We tested all the possible configurations, i. e.
        task    system               formal    informal
                                                                 all the combinations of the parameters described
        main    corrected             92.12       90.11          above, and, to account for random effects during
                gold tokens           92.31       90.66
                Tamburini (2020)      93.49       91.13          training, ran each configuration 10 times. This re-
                                                                 sulted in 960 experimental runs, each correspond-
        subA    corrected             91.92       89.45
                gold tokens           92.12       89.97          ing to a single datapoint in our regression analysis.
                Tamburini (2020)      86.47       83.16          Given that it is reasonable to assume that specific
        subB    corrected             92.37       89.97          parameter values will influence the performance
                gold tokens           92.54       90.53          of other parameters (e. g., use of Morph-it! could
                Tamburini (2020)      89.74       89.52
                                                                 boost performance but only if larger corpora are
Table 2: Accuracy scores for our submissions in                  employed), we also test all the 2-way interactions.
two variants: (i) With ADP_DET corrected to                      As a sanity check, we also introduce the number
ADP_A and (ii) based on the true token bound-                    of an experimental run as a predictor (1 to 10, as
aries instead of on UDPipe tokens.                               a categorical variable), in the hope, obviously, of
                                                                 finding no effect for it. Summing up, our regres-
                                                                 sion equation looks as follows:
4.2     Mining tagging accuracy
                                                                      accuracy ∼ (setup + silver + ud +
To get a better insight into the impact of the differ-                      morph + brown + run)∧213
ent experimental variables involved in this study,               Unsurprisingly, our model achieves an excellent fit
we carried out feature ablation experiments which                to the data, quantified in an Adjusted R-squared
targeted the different components of our system,                 of 95.2%. Table 3 lists all significant predictors
namely the different combinations of training and                and interactions, along with their explained vari-
test data (formal vs. informal) and the different                ance. Explained variance quantifies the portion of
additional resources described in section 2 (use                 the total R-squared that a specific parameter (or
of Brown clusters, Morph-it!, silver data, and UD                interaction) is responsible for and can be straight-
corpora). We then carried out a linear regression                forwardly interpreted as the impact that the ma-
analysis with tagging accuracy as a dependent                    nipulation of a specific parameter has on the accu-
12 Unfortunately,   when preparing our submission, we did        racy of our tagger. Reassuringly, we found no ef-
    not notice that contractions of prepositions (ADP) and       fect of experimental run. All other predictors, and
    determiners (DET) have to be tagged as ADP_A. As
    a consequence, we mis-tagged all these contractions as       13 Given that we ran the regression analysis in R, and the
    ADP_DET. For reference, here are the evaluation results       equation follows the R syntax in which “∧2” denotes all
    of our faulty submission on the formal/informal test sets:    pairwise interactions of the predictors between parenthe-
    main 87.56/88.24, subA 87.37/87.58, subB 87.81/88.11.         ses.
          Predictor      Explained variance                                                            0
                                                                                                                     silver
                                                                                                                     ●    1              ●


          setup                  42.06 ***                                      ●
                                                                                                                                                       ●

                                                                   92
          silver                  8.62 ***                                                                 ●

          ud                     12.63 ***                                   ●


          brown                   8.76 ***                         91
                                                                                                                                                      ●

          morph                   7.17 ***
          setup:silver            1.21 ***
          setup:ud                1.08 ***                         90


                                                        accuracy
                                                                                           ●               ●


          setup:brown             0.42 ***                                                                                                                                  ●


          setup:morph             0.50 ***                         89
                                                                                           ●

                                                                                                                                     ●


          silver:ud               6.00 ***                                                                                                                               ●


          silver:brown            0.39 ***
          silver:morph            1.98 ***                         88


          ud:brown                0.03 *                                                                                            ●

          ud:morph                2.48 ***                         87
          brown:morph             2.44 ***                               all_formal   all_informal   formal_formal            formal_informal   informal_formal   informal_informal
                                                                                                               training_test


Table 3: Regression on tagging a accuracy: pre-                              Figure 1: Interaction: setup and silver data
dictors and explained variance. Adj. R-squared:
                                                                                                                     morph
95.2%. Sign. thresholds: ***: 0.001; *: 0.05.                                                          0             ●    1              ●


                                                                   92


all the corresponding interactions, turned out to be
                                                                                                                                                                             ●


highly significant (with one minor exception). The                 91

                                                                                                                                                                        ●


biggest role is played by the setup variable, which                              ●


                                                                   90
                                                        accuracy


alone accounts for 42.06%. Using UD corpora in
the training has also a strong impact, with a strong
                                                                   89
interaction involving the use of silver data (6.00%                         ●


R-squared). Further strong interactions are found                  88

between brown and morph, and brown and UD –
probably suggesting that introducing a 3-way in-                   87
                                                                             0                                                                                           1
teraction would be appropriate here. Given the in-                                                                       ud


creased complexity, however, this extension is left                     Figure 2: Interaction: UD corpora and Morph-it!
for future work.
   Now that we have established which parameters                   in its strongest interaction, the one with silver.
or interactions have the strongest impact on model                 Figure 1 displays the predicted accuracies result-
performance, it is time to ask which parameter                     ing from the different parameter combinations of
values ensure the best performance. In our case,                   the two predictors. Note that, given the excellent
given that the system can be assembled incremen-                   fit of the regression model, we can assume pre-
tally (adding external resources and training data                 dicted accuracy to be a reliable estimate of actual
to a basic configuration), asking what the best pa-                accuracy. Also, note that while we are visualiz-
rameter values are amounts to determining if, for                  ing the predicted accuracy of a 2-way interaction,
example, the addition of Brown clusters improves                   we are actually displaying the effect of the indi-
performance or is detrimental. Note that the sig-                  vidual terms (setup and silver) and of the interac-
nificance of the brown predictor in the regression                 tion (setup:silver) jointly. We observe that, unsur-
analysis already tells us that the predictor affects               prisingly, independently of the use of silver data,
performance, ruling out the possibility that it has                training on the whole dataset ensures the best per-
no impact at all. To visualize the effects in the                  formance on both the formal and informal test sets.
linear model, we follow Lapesa and Evert (2014)                    The use of silver data (pink line) improves perfor-
and employ effect displays which show the partial                  mance, but with differences in the different train-
effect of one or two parameters by marginalizing                   ing/test setups. Interestingly, using the silver data
over all other parameters. Unlike coefficient esti-                makes the performance gap between the models
mates, they allow an intuitive interpretation of the               trained on the whole dataset and those trained on
effect sizes of categorical variables irrespective of              just the informal dataset negligible. Surprisingly,
the dummy coding scheme used.                                      we observe that the best performance is predicted
   Let us start with the strongest predictor, setup,               for the formal test set when the informal set is
used. Further experiments on the complementar-          pervised neural hidden Markov model (HMM) for
ity of the two subtasks are needed to further clarify   part-of-speech induction.
this contradiction.                                        The architecture of the unsupervised HMM fol-
   Figure 2 displays the interaction between the        lows the LSTM-based variant described by Tran
use of UD corpora and the integration of Morph-it!      et al. (2016). We directly use the negative loga-
in SoMeWeTa. Note that the performance gaps are         rithm of the observation likelihood determined by
smaller here than in the previous interaction: this     the backward algorithm as additional loss for the
is no surprise, given the smaller explanatory power     language model. The embeddings of the best tag
(explained variance) of the parameters and inter-       sequence (determined using the Viterbi algorithm)
actions involved. Morph-it! produces substantial        are added to the word embeddings before feeding
improvements, but again, to a lesser extent if UD       them into the language model. Due to time and re-
corpora are employed: this could either be due          source constraints, we opt for a small to medium-
to a lower coverage of Morph-it! on the UD cor-         sized model14 with a total of 45.5 million train-
pora, or to the boost in model robustness produced      able parameters and train it on 1.9 billion tokens
by the introduction of a larger training set. The       of text (the corpora described in Section 2.1 ex-
steep slope of the blue line wrt. the pink one sug-     cluding OSCAR). The model variant with the un-
gests that the presence of a morphological lexicon      supervised HMM totals 48.7 million trainable pa-
like Morph-it! can compensate the lack of training      rameters. We pre-train and fine-tune both models
data. Let us conclude with the third strongest in-      with the same set of parameters.15
teraction, the one between the use of Brown clus-          The results are summarized in Table 4. Due
ters and the use of Morph-it!, not shown here for       to the small model size and relatively little train-
space constraints. It is strikingly similar to the      ing data, the performance of both models is be-
one in Figure 2: Morph-it! improves performance         low SoMeWeTa’s. (Keep in mind that state-of-the-
overall, and the steeper improvement in absence of      art language models for Italian like UmBERTo or
the Brown clusters suggests that the quality of the     GilBERTo16 are based on the same RoBERTa ar-
information encoded in Morph-it! can compensate         chitecture but feature roughly three times as many
for the lack of external resources.                     parameters and have been trained on an order of
   In sum, our analysis supports the starting as-       magnitude more data.) However, the experiment is
sumption that in a low-resource setting like the        successful insofar as explicitly modelling part-of-
one of KIPoS, integrating additional, focussed re-      speech information using an unsupervised HMM
sources always supports performance.                    gives modest gains on both test sets. On the union
                                                        of the two test sets, this corresponds to a statisti-
5   Additional experiments: RoBERTa                     cally significant improvement from 89.84 to 90.42
    with unsupervised HMM                               (McNemar mid-p test: p = 0.0133).
Fine-tuned neural language models have been ex-
                                                                   model                 formal    informal
tremely successful in all areas of natural language
processing (NLP). Not only can language mod-                       RoBERTa                91.28       88.46
                                                                   RoBERTa+HMM            91.84       89.05
els trained on huge amounts of plain text be fine-
tuned to all NLP tasks, they have also been shown       Table 4: Results for RoBERTa and for RoBERTa
to learn certain linguistic abstractions (Tenney et     with additional unsupervised HMM
al., 2019). At least that seems to be the case for
English. Languages that are typologically differ-
ent from English are both more difficult to model       14 We   use the RoBERTa implementation from the trans-
with current architectures (Mielke et al., 2019)           formers library (https://github.com/huggingface/
and seem to be more challenging when it comes              transformers) with 6 hidden layers, 8 attention heads,
to learning linguistic abstractions (Ravfogel et al.,      a hidden size of 512 and an intermediate size of 2048.
                                                        15 Pretraining for 100,000 steps with a batch size of 500, peak
2018). In the experiment described in this sec-            learning rate of 5 × 10−4 , 6,000 warm-up steps and dropout
tion, we extend a state-of-the-art language model          set to 0.1. Fine-tuning to the KIPoS task using the entire
architecture to explicitly model part-of-speech in-        training data for 4 epochs with a batch size of 32 and learn-
                                                           ing rate of 3 × 10−4
formation. To this end, we combine a RoBERTa            16 https://github.com/musixmatchresearch/
language model (Liu et al., 2019) with an unsu-            umberto, https://github.com/idb-ita/GilBERTo
6   Conclusion                                          Gabriella Lapesa and Stefan Evert. 2014. A large
                                                          scale evaluation of distributional semantic models:
This paper started out with the assumption that           Parameters, interactions and model selection. TACL,
in low-resource scenarios like the KIPoS shared           2:531–546.
task the integration of additional resources such as
                                                        Percy Liang. 2005. Semi-supervised learning for nat-
lexica (in our case, Morph-it!) and distributional        ural language. Master’s thesis, Massachusetts Insti-
information from larger corpora (in our case, the         tute of Technology, Department of Electrical Engi-
Brown clusters) can compensate for the lack of            neering and Computer Science.
large amounts of training data. Moreover, our           Pierre Lison and Jörg Tiedemann. 2016. Opensub-
strategy also built on the assumption that in a low-       titles2016: Extracting large parallel corpora from
resource scenario domain adaptation would be a             movie and TV subtitles. In Proc. of LREC, pages
winning strategy, as it would enable us to exploit         923–929, Portorož. ELRA.
larger training sets for written language (out of do-   Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
main), and then fine-tune the tagger on the spoken        dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
language (in domain). The results of our exper-           Luke Zettlemoyer, and Veselin Stoyanov. 2019.
iments, and the insights gathered from the statis-        RoBERTa: A robustly optimized BERT pretraining
                                                          approach.
tical analysis of our results indicate that both as-
sumptions hold to be true, as far as our contri-        Verena Lyding, Egon Stemle, Claudia Borghetti, Marco
bution to the KIPoS shared task is concerned. In          Brunello, Sara Castagnoli, Felice Dell’Orletta, Hen-
                                                          rik Dittmann, Alessandro Lenci, and Vito Pirrelli.
subtasks A and B, where only half the amount of           2014. The PAISÀ corpus of Italian web texts. In
training data was available, this strategy even out-      Proc. of WaC-9, pages 36–43, Gothenburg. ACL.
performed a fine-tuned state-of-the-art neural lan-
guage model. Further work is needed to assess the       Caterina Mauri, Silvia Ballarè, Eugenio Goria, Mas-
                                                          simo Cerruti, and Francesco Suriano. 2019. KIParla
complementarity of the error profiles of different        corpus: A new resource for spoken Italian. In Proc.
configurations, taking into the picture also the neu-     of CLiC-it, Bari.
ral architectures evaluated in Section 4.
                                                        Sabrina J. Mielke, Ryan Cotterell, Kyle Gorman, Brian
                                                          Roark, and Jason Eisner. 2019. What kind of lan-
                                                          guage is hard to language-model? In Proc. of ACL,
References                                                pages 4975–4989, Florence. ACL.
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
  cia C. Passaro. 2020. Evalita 2020: Overview          Pedro Javier Ortiz Suárez, Benoît Sagot, and Laurent
  of the 7th evaluation campaign of natural language      Romary. 2019. Asynchronous pipelines for pro-
  processing and speech tools for italian. In Valerio     cessing huge corpora on medium to low resource in-
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.       frastructures. In Proc. of CMLC-7, pages 9 – 16,
  Passaro, editors, Proceedings of Seventh Evalua-        Mannheim. Leibniz-Institut für Deutsche Sprache.
  tion Campaign of Natural Language Processing and
                                                        Thomas Proisl, Peter Uhrig, Philipp Heinrich, Andreas
  Speech Tools for Italian. Final Workshop (EVALITA
                                                          Blombach, Sefora Mammerella, Natalie Dykes, and
  2020), Online. CEUR.org.
                                                          Besim Kabashi. 2019. The_Illiterati: Part-of-
Andreas Blombach, Natalie Dykes, Philipp Heinrich,        speech tagging for Magahi and Bhojpuri without
  Besim Kabashi, and Thomas Proisl. 2020. A corpus        even knowing the alphabet. In Proc. of NSURL,
  of German Reddit exchanges (GeRedE). In Proc. of        Trento.
  LREC, pages 6310–6316, Marseille. ELRA.
                                                        Thomas Proisl. 2018. SoMeWeTa: A part-of-speech
Cristina Bosco, Silvia Ballarè, Massimo Cerruti,          tagger for German social media and web texts. In
  Eugenio Goria, and Caterina Mauri.       2020.          Proc. of LREC, pages 665–670, Miyazaki. ELRA.
  KIPoS@EVALITA2020: Overview of the task on
  KIParla part of speech tagging.    In Proc. of        Shauli Ravfogel, Yoav Goldberg, and Francis Tyers.
  EVALITA. CEUR.org.                                      2018. Can LSTM learn to capture agreement? The
                                                          case of Basque. In Proc. of BlackboxNLP, pages 98–
Peter F. Brown, Vincent J. Della Pietra, Peter V.         107, Brussels, November. ACL.
  de Souza, Jennifer C. Lai, and Robert L. Mercer.
  1992. Class-based n-gram models of natural lan-       Fabio Tamburini. 2020. UniBO@KIPoS: Fine-tuning
  guage. Computational Linguistics, 18(4):467–479.        the Italian “BERTology” for the EVALITA 2020
                                                          KIPOS task. In Proc. of EVALITA. CEUR.org.
Besim Kabashi and Thomas Proisl. 2018. Alba-
  nian part-of-speech tagging: Gold standard and        Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.
  evaluation. In Proc. of LREC, pages 2593–2599,           BERT rediscovers the classical NLP pipeline. In
  Miyazaki. ELRA.                                          Proc. of ACL, pages 4593–4601, Florence. ACL.
Ke M. Tran, Yonatan Bisk, Ashish Vaswani, Daniel
  Marcu, and Kevin Knight. 2016. Unsupervised neu-
  ral hidden Markov models. In Proc. of the Work-
  shop on Structured Prediction for NLP, pages 63–
  71, Austin, TX. ACL.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!
  A free corpus-based morphological resource for the
  Italian language. In Proc. of Corpus Linguistics,
  Birmingham.