Combining Different Parsers and Datasets for CAPITEL
UD Parsing
Fernando Sánchez-Leóna
a
    Independent Academic


                                          Abstract
                                          This paper describes our experiments on Universal Dependency parsing of a subset of capitel article news
                                          corpus, prepared for a competition within IberLEF 2020 Evaluation Forum. Several data-driven systems, using
                                          different technologies, are used for the task. Besides the training dataset provided by the organizers, we aug-
                                          ment the training set using the training partition from other widely used UD-parsed Spanish corpus —AnCora.
                                          On top of this combination of toolkits and corpora, a voting strategy is proposed, where best scoring system
                                          predictions are combined to boost final performance. This combined model ranked first in the above-mentioned
                                          competition.

                                          Keywords
                                          Universal Dependencies, parsing, Spanish, News articles corpus, capitel, AnCora, data augmentation, parser
                                          output combination


1. Introduction
Since the first attempt to build a Universal Dependency Treebank [1], the interest in refining the
original proposal, define a universal part-of-speech tagset, and, most prominently, produce corpora
for new languages (and/or improve the annotation of the existing ones) and implement multilingual
dependency parsers has grown over time.
   Universal Dependencies (UD), as a project, has now1 163 treebanks, covering 92 languages. The UD
family has grown with a new offspring, since the Spanish Government PlanTL for the Advancement of
Language Technology, as part of its open R+D lines, has developed capitel, a linguistically annotated
corpus of Spanish news articles, from which a small subset2 , syntactically annotated using UD v2, has
been manually revised for the purpose of the capitel-ud competition [2].
   This paper describes our participation in such competition, where 12 participants were involved.
Unfortunately, only two of them finally submitted their results. One of our prediction combinations
ranked first in the official scoring table with a LAS score of 88.660, while the second system obtained
88.600 for the same metrics.
   Section 2 describes our experiments with different data-driven UD parsing implementations as well
as the voting strategy adopted. Results of our submitted systems or system combinations besides a
preliminary discussion of them and other issues regarding UD are presented in section 3.


Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020)
email: f.sanchez.lcmcvp@gmail.com (F. Sánchez-León)
orcid:
                                       © 2020 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


                  1
      As of May 15, 2020.
    2
      “Small subset” should be interpreted here as compared to the size of the whole capitel corpus —that will be several
orders of magnitude larger— and not to the mean size of any other corpus parsed for UD.
2. Approach
Rather than developing another UD parsing system, we have opted for the (re)use of well established
UD parsers for the estimation of UD parsing models from the capitel-ud datasets. There were a num-
ber of reasons for this decision, ranging from the number of existing systems capable of performing
UD analysis to our desire to experiment with the combination of various prediction systems, as it is
explained below.
   The final decision was to use at least three different toolkits to the UD parsing problem, not nec-
essarily the best scoring ones, from the last 2-3 years shared tasks on this issue. The selection of
systems was also performed based on criteria of ease of set up and configuration, as well as maturity
(and clarity) of online documentation of each candidate system. Finally, the three solutions that have
been tested are UDPipe, NLP-Cube and Stanza. Our experiments with them are described in the rest
of this section.

2.1. UDPipe
UDPipe [3] is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing
of CoNLL-U files. Program version used in this experiment, 1.2, ranked 8th in CoNLL 2017 SharedTask:
Multilingual Parsing from Raw Text to Universal Dependencies.3
   Only the dependency parsing module has been trained, since capitel-ud datasets provide tok-
enization, tagging and lemmatization for all data distributed to challenge participants. Datasets are
provided in CoNLL-U format.
   Although its main author has developed a UDPipe 2.0 Prototype the CoNLL 2018 UD Shared
Task [4] (known as UDPipe-Future),4 we wanted to give a try to the original UDPipe implementa-
tion, and consider it as our baseline system for the current challenge. UDPipe 1.2 uses word embed-
dings but its implementation is previous to the (now generalized) contextualized word embeddings
available in, for instance, UDPipe 2.0.
   In our experimentation with UDPipe, we have used the train dataset with random search for hy-
perparameters (using the run=number option). Besides, with all three packages used, a ten-fold cross
validation using training material has been performed, using this to select the model with best per-
formance.
   The only remarkable tuning in our experiments with respect to the original setup has to do with
form and lemma embeddings. UDPipe authors pre-compute form embeddings with word2vec using
the training data. All other embeddings used by the system are initialized randomly and updated
during training. We rather have also used form and lemma embeddings computed from a 500M word
corpus of newspaper text we have compiled and POS tagged. Besides, we have used fastText word
embeddings trained on Common Crawl and Wikipedia corpora.5
   Table 1 shows results on development set of a model built with the training material using different
word embeddings.6
   Increasing the number of iterations on the training set above 20 (the value recommended by UDPipe
authors) does not improve any of the models. Our results for LAS are on a par with those obtained
by UDPipe developers on UD 2.3 AnCora corpus [5], whose LAS score is 84.6 for raw text, but using


   3
     http://universaldependencies.org/conll17/results.html.
   4
     Available at https://github.com/CoNLL-UD-2018/UDPipe-Future.
   5
     Available at https://fasttext.cc/docs/en/crawl-vectors.html.
   6
     The rest of the parameters are those suggested by UDPipe developers. We refer to validation dataset as dev.


                                                          40
Table 1
Results on dev dataset using UDPipe with different embeddings and iterations

                                Iterations   Embeddings          UAS      LAS
                                10           train               87.03    83.23
                                10           train+dev           86.96    83.23
                                10           News Corpus         87.78    84.05
                                10           CC                  87.55    83.86
                                20           train               87.61    83.77
                                20           train+dev           87.68    84.00
                                20           News Corpus         88.04    84.32
                                20           CC                  87.57    83.90


Table 2
Results for NLP-Cube with different embeddings

                                      Embeddings         UAS      LAS
                                      CC                 90.69    86.84
                                      News Corpus        90.28    86.16


form and lemma embeddings derived from AnCora itself. It seems that using word embeddings from
big enough volumes of same genre text improves results with this toolkit.

2.2. NLP-Cube
NLP-Cube is an open source framework developed by Adobe that competed in CoNLL 2018 (it ranked
9th for LAS score).7 This toolkit provides an end-to-end text processing solution using neural net-
works, and it includes modules for sentence splitting, tokenization, lemmatization, part-of-speech
tagging, dependency parsing and named entity recognition for more than 50 languages [6].
   For this software package, we have used off-the-shelf hyperparametes with Common Crawl em-
beddings. Early stopping was set to 40 (default is 20), to allow for maximal optimization of the model.
This time, performance is lower using News Corpus embeddings, as it can be seen in table 2.
   NLP-Cube uses “multiple stacked bidirectional LSTM for the parser layers and project 4 specialized
representations for each word in a sentence, which are later aggregated in a multilayer perceptron in
order to produce arc and label probabilities.” [6]
   Although not tested for this competition (in which POS tagging is already provided by organizers),
NLP-Cube is an interesting model to explore since it allows jointly training to also output morpho-
logical features. According to the authors, this combined training increases the absolute UAS and LAS
scores by up to 1.5%.

2.3. Stanza
Stanza [7] is a Python natural language analysis toolkit developed by Stanford University. Its mod-
ules are built on top of the PyTorch library and it provides pre-trained models for 66 languages. It

   7
       http://universaldependencies.org/conll18/results.html.


                                                    41
can interface with the popular Stanford CoreNLP Java package from the same institution. For UD,
Stanza implements a Bi-LSTM-based deep biaffine neural dependency parser, augmented with two
linguistically motivated features that handle linearization order of two words in a given language and
prediction of the typical distance in linear order between them.
   The results obtained with this parser outperform those of UDPipe by nearly 5 points, as it is shown
in table 3. For this reason, we also considered the possibility of building another model with this
parser, just in case UDPipe results could degrade the overall performance of the joint system.

   For the above-mentioned purpose, we have tried to augment the training data pooling together the
training partitions from both capitel-ud and AnCora.8 We were aware of the different treatment of
various Spanish language phenomena in capitel-ud and AnCora. These affect tokenization, morpho-
logical analysis and dependency relations. A by no means exhaustive list of differences is presented
in the following paragraphs.

  Tokenization: Words al and del (preposition+masculine article) are split in capitel-ud whereas
they remain as a single token in AnCora, being the former decision compliant with UD v2 guidelines.
Other multiword elements like adverbial, prepositional and conjunctive phrases are represented in
one token in capitel-ud but, although recognized as a unit in the MISC column of CoNLL-U format,
they are split in AnCora, with AnCora practice as UD v2 compliant. Perfect verbal forms, which are
syntagmatic in Spanish, are considered one token in capitel-ud but two in AnCora, again being An-
Cora decision compliant with UD v2 guidelines.

   Morphological analysis: Adjectives not showing morphological features for gender (leve —either
masculine of feminine— as opposed to recto —only masculine—) don’t include this feature in the parse
tree in AnCora, which is however present in capitel-ud analyses; both annotation practices are al-
lowed by UD v2 guidelines in this case.

   Dependency relations: Relative pronouns, like the very frequent word que, are systematically
labeled with the corresponding syntactic function in AnCora while capitel-ud uses a kind of double
labeling where both the function and the label mark are used; although described in the accompa-
nying capitel-ud documentation, these features are not prescribed in the UD v2 guidelines. Relative
clauses are labeled as acl in AnCora while capitel-ud uses the language specific acl:relcl; both
annotations are UD v2 compliant. In complex predicates like llevar a cabo, the noun is labeled as
compound in AnCora (in accordance with the UD guidelines), being an obl in capitel-ud.

    From this incomplete comparison of annotation styles, let us note in passing that none of the cor-
pora perfectly follow the UD guidelines, although AnCora is closer to full compliance than capitel-ud
is.
    In spite of these differences, specially those showing two dependency relations coding styles, data
augmentation was also considered in order to train a model for Stanza using training material from
both capitel-ud and AnCora corpora, in order to have three times more training data than that pro-
vided by capitel-ud organizers. Results for both models are presented in table 3.
    As can be observed, the pooled model using data augmentation increases performance by 0.11%
with respect to capitel-ud set alone. Data efficiency is low, since for this LAS gain, 14,305 new training

    8
      This is actually an area of improvement for Stanza suggested by its authors to provide language models more robust
for different genres [7].


                                                          42
Table 3
Results for Stanza with different training sets

                             Dataset                                      UAS      LAS
                             capitel-ud training dataset                  91.54    88.19
                             capitel-ud + AnCora training datasets        91.53    88.30


Table 4
Results for system combinations

                    Systems                                                        UAS      LAS
                    UDPipe, NLP-Cube, Stanza-C, Stanza-CAn                         91.65    88.39
                    UDPipe, Stanza-C, Stanza-CAn                                   91.63    88.37
                    UDPipe, NLP-Cube, Stanza-C                                     91.38    88.02
                    UDPipe, NLP-Cube, Stanza-CAn                                   91.41    88.16
                    UDPipe, Stanza-C, Stanza-CAn                                   91.63    88.29
                    NLP-Cube, Stanza-C, Stanza-CAn                                 91.72    88.46


sentences have been used —the whole AnCora training partition. However, this is probably due to the
sharp differences in the annotation styles from both datasets. A post-competition experiment using
data augmentation with only 500 new sentences from the same genre (analyzed with the Stanza model
and hand corrected by the author) delivers a LAS score of 88.46. Note that this is also the best score
obtained with our pooled model (see section 2.4).
  With this improvement in mind, had we had more time for the competition, we could have im-
plemented an automatic (partial) solution to bring both corpora closer. Unfortunately, in order to
outperform other possible competitors, this should have been done by adopting capitel-ud develop-
ers decisions, which, as already stated, deviate from the UD guidelines.

2.4. Voting strategy
Our first underlying plan for this competition was that of combining predictions from various UD
parsers, as an attempt to further improve our overall performance. Rather than implementing our
own solution, we have used for this purpose conllu-voting.py, a Python implementation of Chu-
Liu-Edmonds minimum spanning tree over a graph of CoNLL-U files, which is part of a set of “[s]cripts
for compatibilitising between VISL-CG3, Apertium, CoNLL-X and Universal Dependencies”.9
   There were four system predictions to combine —those of UDPipe, NLP-Cube and two models
produced with Stanza. Table 4 shows results on development dataset using several of the possible
combinations available.10
   As table 4 shows, although the combination of all parsing systems improves our overall perfor-
mance by 0.09%, it is leaving out our lowest performant system, UDPipe, that we get an LAS increase
of 0.16%. This is, then, the combination used in one of our runs.


    9
        https://github.com/ftyers/ud-scripts.
   10
     We use Stanza-C for the system using capitel-ud as training material and Stanza-CAn for the system with capitel-ud
and AnCora training datasets.


                                                          43
Table 5
Results for system runs submitted on test dataset

                                          System    UAS     LAS
                                          C         91.72   88.47
                                          CA        91.77   88.53
                                          CACV      91.94   88.66


3. Results and discussion
The organizers limited the number of runs allowed to be submitted by each participating team to
three. System submissions were then arranged according to performance obtained with development
material. Hence, system runs from both models built with Stanza (renamed in table 5 as C and CA)
were submitted. Besides, the best of our combinations (CACV) was also submitted. These scores refer
to test dataset, and are, thus, our final results in the competition.
   It is important to note that none of the models is optimal, since an extremely low early stopping
condition of 5 was applied to every model built. Nonetheless, results seem to support the maxim
that more data is better data, even in the case of using very heterogeneous training sources. This
fact can be observed in the 0.06% LAS improvement of system using both capitel-ud and AnCora
training material, a small improvement due to a low data efficiency that can be attributed to the
differing (dependency) annotation styles of the corpora used. If a smaller but tighter dataset is used
for data augmentation, the above-mentioned maxim seems to be more neatly supported (0.16% LAS
improvement with only 500 new sentences).
   Most importantly, however, the use of several parsing models built with different software solutions
and adequately combined results in a slight, but promising, performance boost. This is clearly an
avenue that needs further exploration.


Acknowledgments
We would like to thank two anonymous reviewers for their careful reading of our contribution and
their many insightful comments and suggestions.

   This work was first presented under the name Martín Lendínez. This is the pseudonym used by
the Spanish writer Mariano Antolín Rato in his translations to Spanish of the Beat Generation North-
American writers. The name stemmed as a joke with his friends Juan Cueto and Gonzalo Suárez, who
had signed a number of works under the name Claudio Lendínez junior. Martín became a kind of
hippie brother for Claudio, a brother with a long and wide professional career. We had used his name
in this shared task as a tribute to the wonderful work of Antolín Rato in those times of change in
Spain, and more modestly to enlarge Martin’s CV. Unfortunately, due to reasons beyond the control
of the author, it has been impossible to keep this author name during the whole publishing process.


References
[1] R. McDonald, J. Nivre, Y. Quirmbach-Brundage, Y. Goldberg, D. Das, K. Ganchev, K. Hall, S. Petrov,
    H. Zhang, O. Täckström, C. Bedini, N. Bertomeu Castelló, J. Lee, Universal Dependency annota-


                                                    44
    tion for multilingual parsing, in: Proceedings of the 51st Annual Meeting of the Association for
    Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics,
    Sofia, Bulgaria, 2013, pp. 92–97. URL: https://www.aclweb.org/anthology/P13-2017.
[2] J. Porta-Zamorano, L. Espinosa-Anke, Overview of CAPITEL Shared Tasks at IberLef 2020, in:
    Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020), 2020.
[3] M. Straka, J. Straková, Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe,
    in: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal
    Dependencies, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 88–99.
    URL: https://www.aclweb.org/anthology/K17-3009. doi:10.18653/v1/K17-3009.
[4] M. Straka, UDPipe 2.0 prototype at CoNLL 2018 UD shared task, in: Proceedings of the CoNLL
    2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Association
    for Computational Linguistics, Brussels, Belgium, 2018, pp. 197–207. URL: https://www.aclweb.
    org/anthology/K18-2020. doi:10.18653/v1/K18-2020.
[5] M. Taulé, M. A. Martí, M. Recasens, AnCora: Multilevel annotated corpora for Catalan and Span-
    ish, in: Proceedings of the Sixth International Conference on Language Resources and Evaluation
    (LREC’08), European Language Resources Association (ELRA), Marrakech, Morocco, 2008. URL:
    http://www.lrec-conf.org/proceedings/lrec2008/pdf/35_paper.pdf.
[6] T. Boros, S. D. Dumitrescu, R. Burtica, NLP-cube: End-to-end raw text processing with neural
    networks, in: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text
    to Universal Dependencies, Association for Computational Linguistics, Brussels, Belgium, 2018,
    pp. 171–179. URL: https://www.aclweb.org/anthology/K18-2017. doi:10.18653/v1/K18-2017.
[7] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A python natural language processing
    toolkit for many human languages, in: Proceedings of the 58th Annual Meeting of the Associa-
    tion for Computational Linguistics: System Demonstrations, Association for Computational Lin-
    guistics, Online, 2020, pp. 101–108. URL: https://www.aclweb.org/anthology/2020.acl-demos.14.
    doi:10.18653/v1/2020.acl-demos.14.


                                                 45