=Paper= {{Paper |id=Vol-2989/long_paper20 |storemode=property |title=More Data and New Tools. Advances in Parsing the Index Thomisticus Treebank |pdfUrl=https://ceur-ws.org/Vol-2989/long_paper20.pdf |volume=Vol-2989 |authors=Federica Gamba,Marco Passarotti,Paolo Ruffolo |dblpUrl=https://dblp.org/rec/conf/chr/GambaPR21 }} ==More Data and New Tools. Advances in Parsing the Index Thomisticus Treebank== https://ceur-ws.org/Vol-2989/long_paper20.pdf
More Data and New Tools. Advances in Parsing the
Index Thomisticus Treebank
Federica Gamba1 , Marco Passarotti2 and Paolo Ruffolo2
1
  Istituto Universitario di Studi Superiori (IUSS), Palazzo del Broletto, Piazza della Vittoria 15, 27100 Pavia -
Italy
2
  CIRCSE Research Centre – Università Cattolica del Sacro Cuore, Largo A. Gemelli 1 – 20123 Milan – Italy


                                 Abstract
                                 This paper investigates the recent advances in parsing the Index Thomisticus Treebank, which en-
                                 compasses Medieval Latin texts by Thomas Aquinas. The research focuses on two types of variables.
                                 On the one hand, it examines the impact that a larger dataset has on the results of parsing; on the
                                 other hand, performances of new parsers are analysed with respect to less recent tools. Term of com-
                                 parison to determine the effective parsing advances are the results in parsing the Index Thomisticus
                                 Treebank described in a previous work. First, the best performing parser among those concerned in
                                 that study is tested on a larger dataset than the one originally used. Then, some parser combinations
                                 that were developed in the same study are evaluated as well, assessing that more training data result
                                 in more accurate performances. Finally, to examine the impact that newly available tools have on
                                 parsing results, we train, test, and evaluate two neural parsers chosen among those best performing
                                 in the CoNLL 2018 Shared Task. Our experiments reach the highest accuracy rates achieved so far
                                 in automatic syntactic parsing of the Index Thomisticus Treebank and of Latin overall.

                                 Keywords
                                 dependency parsing, Latin




1. Introduction
Built upon the Index Thomisticus corpus, which collects the opera omnia of Thomas Aquinas
[8], the Index Thomisticus Treebank (IT-TB) [20] represents a prime linguistic resource among
those currently available for Latin. Developed at the CIRCSE research center in Milan, the
IT-TB is the largest Latin treebank among the five currently available for this language (see
Section 2). However, despite the good availability of syntactically annotated corpora for Latin,
a number of setbacks do emerge when it comes to parsing Latin. First of all, the richly inflected
nature of Latin results in a quite high rate of non-projectivity in dependency trees, which
arises due to long distance dependencies in languages with flexible word order and tends to
impact negatively the accuracy rates of automatic parsing. Nevertheless, the Medieval Latin
of Thomas Aquinas’ texts appears to be less non-projective than Classical Latin, as outlined
by Passarotti and Ruffolo [22], who report that a rate of 3.24% non-projective dependencies
in the IT-TB contrasts the 6.65% in the Latin Dependency Treebank, which includes Classical
Latin texts. Moreover, the wide diachronic, diatopic and diaphasic variability of Latin affects

CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The
Netherlands
£ federica.gamba@iusspavia.it (F. Gamba); marco.passarotti@unicatt.it (M. Passarotti);
paolo.ruffolo@posteo.net (P. Ruffolo)
DZ 0000-0002-9806-7187 (M. Passarotti); 0000-0002-9120-0846 (P. Ruffolo)
                               © 2021 Copyright for this paper by its authors.
                               Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Wor
    Pr
       ks
        hop
     oceedi
          ngs
                ht
                I
                 tp:
                   //
                    ceur
                       -
                SSN1613-
                        ws
                         .or
                       0073
                           g

                               CEUR Workshop Proceedings (CEUR-WS.org)




                                                                                   108
the overall accuracy rates of different parsers, as they tend to provide higher accuracy rates
on those texts that resemble the specific textual variety they were trained on (cf. [22]).
   This paper describes a study aimed to improve the performances of automatic dependency
parsing for the IT-TB in its native annotation scheme, taking as a benchmark the research
described by Ponti and Passarotti [23], who, after testing different parsers, individuated DeSR
[1] as the best performing one. After building a new feature model for DeSR specifically suited
for the IT-TB, Ponti and Passarotti [23] applied a post-processing combination technique and
showed that combining parsers using different types of algorithms returned better parsing
results than plain DeSR.
   Recent years have seen many steps forward for what concerns both the size and the type of
the available linguistic resources for Latin, as well as the performances of probabilistic tools
for natural language processing (NLP) purposes. As for the IT-TB, the size of the treebank
has grown remarkably since the study of Ponti and Passarotti [23], thus making it possible
to evaluate what impact a larger training set can have on the performances of probabilistic
tools in parsing the IT-TB. As for the NLP tools, across the very last years new techniques
and tools have been developed that exploit the ever growing amount of available training data,
thus making it possible to prove whether the most recent tools turn out to be more efficient
than less recent ones. This paper presents the results obtained by investigating the impact
that these two variables, namely a larger set of training data and new NLP tools, have on
parsing the IT-TB.
   The paper is organised as follows. Section 2 presents an overview of relevant related studies.
In Section 3 the data are presented. Section 4 focuses on the re-evaluation of DeSR perfor-
mances on the new (larger) training set of the IT-TB. Section 5 explores the impact of using
more training data on the accuracy rates of two combinations of parsers. Section 6 reports
the performances of two neural parsers (namely, TurkuNLP and ICS-PAS). In Section 7, we
present the results provided by three combinations of DeSR with the two neural parsers. In
Section 8, an in-depth evaluation of the results is performed and discussed. Finally, Section 9
concludes the paper.


2. Related Work
As mentioned, five treebanks are currently available for Latin. Beside the IT-TB, the other
Latin treebanks (all dependency-based) are the following: the PROIEL treebank [15], the Latin
Dependency Treebank (part of the Ancient Greek and Latin Treebank) [2], the Late Latin
Charter Treebank [10] and the UDante treebank [9]. All the Latin treebanks are annotated
both according to their native scheme and to the Universal Dependencies one (UD) [19], except
for the UDante treebank, which is available only in the UD scheme.
  With respect to parsing the IT-TB, the above-mentioned study by Ponti and Passarotti [23],
which we here take as a benchmark, is preceded by other relevant works in the field. In 2010
Passarotti and Ruffolo [22] trained and tested a number of probabilistic dependency parsers,
by using data from both the IT-TB and the Latin Dependency Treebank (LDT). In the same
year, Passarotti and Dell’Orletta [21] employed DeSR to parse the IT-TB. They delineated an
ad-hoc configuration of DeSR features so as to adapt the parser to the specific processing of
Medieval Latin and improve accuracy rates. They also defined a revision parsing method and
combined the outputs of different algorithms.
  However, the most recent study on parsing the IT-TB is the one carried out by Ponti and




                                              109
Table 1
Size of the T2 training and test sets
                                                   Nodes     Sentences
                                        Training   402,554    24,187
                                        Test        44,752     2,644
                                        Total      447,306    26,831


Passarotti [23]. In particular, for what concerns DeSR, the best results are achieved when
the tool exploits a multilayer perceptron (MLP) algorithm, a reversed direction of the parsing
transition (right-to-left) and a specifically-tuned settings: the best Labeled Attachment Score
reported (LAS) is 83.14 and the highest Unlabeled Attachment Score (UAS) is 88.46 [7].
Regarding the best performing combination of parsers, referred to as C4 in [23], the best
results are 86.5 in LAS and 90.97 in UAS. The results obtained through combination already
represent an improvement with respect to [21], which were the state of the art in parsing
Medieval Latin before [23].
  Thanks to the availability of the UD treebanks for Latin, the CoNLL Shared Task 2018 on
Multilingual Parsing from Raw Text to Universal Dependencies [28] provided results also for
Latin. The tool that proved to perform best on Latin is HIT-SCIR [11], which ranked highest
among all the participants, both in terms of LAS and UAS, for all the Latin treebanks. In
particular, it obtained a 87.08 LAS and a 89.31 UAS on the IT-TB, a 73.61 LAS and a 77.62
UAS on the PROIEL treebank, and a 72.63 LAS and a 80.47 UAS on the Latin Dependency
Treebank.1


3. Data
The data used in the experiments consist in the latest release of the IT-TB in its native
annotation style [3], which resembles that of the analytical layer of the Prague Dependency
Treebank for Czech.2 This version of the treebank features the entire Summa contra Gentiles
(four books) and some excerpts from Summa theologiae and Scriptum super Sententiis Petri
Lombardi selected as part of the concordances of lemma forma ‘form’. Such release of the IT-TB
makes available more data than the versions used as data sources for previous experiments in
dependency parsing for Latin. In particular, with respect to [23] the missing part of the third
book and the entire fourth book of Summa contra Gentiles are now included in the dataset,
corresponding to 11,881 additional sentences and 193,422 additional tokens. For practical
reasons, we define T2 the enlarged dataset and T1 the dataset used in [23].
   For evaluation purposes, the treebank is split in a training set and a test set with a ratio
of about 9:1. Table 1 illustrates the size of the T2 training and test sets resulting from such
partition. When required for the training phase, a development set with the same size of the
test set is excerpted from the training data.



   1
      The LLCT and the UDante treebanks were not used in the the CoNLL Shared Task 2018, since they
have been made available in the UD repository since release v2.6 (May 15th 2020) and v2.8 (May 15th 2021)
respectively.
    2
      https://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html.




                                                     110
Table 2
Results on T1 and T2 by different settings of DeSR
                                               LAS              UAS
                                          T1         T2    T1         T2
                              MLP, l    82.01    81.29    87.35   86.60
                              MLP, r    83.14    83.87    88.46   88.72
                              SVM, r    83.35    83.92    87.25   88.41


4. DeSR Evaluation
After subsetting the dataset in training set and test set, we replicate the first part of the
experiments performed by Ponti and Passarotti [23]. We evaluate the accuracy rates of the
dependency parser DeSR when trained on the new dataset, yet preserving the algorithms and
the feature model defined in [23]. In this way, the only variable to be evaluated is the extended
size of the training and test sets, as all the others remain the same, thus allowing to assess the
impact of a larger amount of training data on the accuracy rates of the parser.
   DeSR [1] is a shift-reduce parser which in its basic settings exploits an MLP algorithm and
performs a left-to-right transition while parsing. We make use of the same version of DeSR
used in [23] (v. 1.4.3).
   First, the performance of DeSR with its basic settings is evaluated. Secondly, we reverse the
direction of transition from standard left-to-right to right-to-left, keeping the same MLP algo-
rithm. Thirdly, the MLP algorithm is replaced by a support vector machine (SVM) algorithm,
while the transition direction is maintained right-to-left, like in [23].
   To sum up, three different settings of DeSR are trained and tested:

   • MLP algorithm, left-to-right (MLP, l);

   • MLP algorithm, right-to-left (MLP, r);

   • SVM algorithm, right-to-left (SVM, r).

   Table 2 shows the accuracy rates of the different settings of DeSR in term of LAS and UAS.
Results obtained by using the enlarged dataset (T2) are placed side by side to those reported
by [23] (T1).
   What emerges from Table 2 is that using a larger training set tends to improve the accuracy
rates obtained by the parser. However, this does not hold true in the case of the left-to-
right MLP algorithm, which proves to be the parser that performs worst in [23]. When it
comes to the two most efficient and accurate parsers (namely, those with reversed direction
of transition), the enlarged dataset improves the accuracy rates of the parser. As far as the
right-to-left MLP algorithm is concerned, we observe a +0.73 improvement in terms of LAS
(83.14 with T1 vs 83.87 with T2) and a +0.26 in terms of UAS. As for the right-to-left SVM
algorithm, when run on T2 it outperforms the same algorithm run on T1 by +0.57 (LAS) and
+0.16 (UAS). However, when the deviation consists in few tenths of percent (as in the UAS
rates just mentioned) the improvements cannot be considered meaningful.
   The results thus obtained will constitute a baseline for the experiments run and presented
later on.




                                                 111
Table 3
Results on T1 and T2 by different combinations of parsers
                                             LAS                 UAS
                                        T1         T2       T1         T2
                                  C3   86.37   86.86     90.91     91.14
                                  C4   86.50   87.37     90.97     91.56


5. DeSR Combination
The following step in replicating the experiment of [23] concerns the combination of different
parsers.
  After examining the outputs of single parsers, Ponti and Passarotti [23] employed a post-
processing technique that combines the outputs of different (types of) parsers. Such technique
exploits the benefits of combination, following the assumption that the mutual difference be-
tween parsers promises to improve the final accuracy rates. For the purposes of combination,
an algorithm based on unweighted voting was used [27]. In our work, we replicate the exper-
iments run on the combinations named respectively C3 and C4 in [23], chosen as those that
reach the highest accuracy rates among the ones concerned. Both C3 and C4 combine outputs
produced by different settings of DeSR with outputs from other types of parsers, namely:

   • MTGB: a graph-based parser from the MATE-tools collection [4], in its latest version
     (anna-3.61);
   • Joint: a shift-reduce parser, part of the MATE-tools collection and developed by Bohnet
     et al. [5].

The structure of the combinations C3 and C4 is the following:
   • C3: DeSR (MLP, r) + DeSR (MLP, l) + Joint + MTGB;
   • C4: DeSR (MLP, r) + DeSR (SVM, r) + DeSR (MLP, l) + Joint + MTGB.
  The results of the application of the post-processing combination technique on both T1 and
T2 are shown in Table 3. Both for C3 and for C4, the larger dataset proves to lead to higher
accuracy rates. In particular, the C4 combination results in the highest gap between T1 and
T2 both for LAS (T1: 86.50, T2: 87.37, improvement: +0.87) and UAS (T1: 90.97, T2:
91.56, improvement: +0.59). These results will serve as a baseline for the next combination
experiments.


6. New Tools
After examining the impact that a larger training/test set has on the accuracy rates of parsing
the IT-TB, we focus on the second variable taken into account in our work, namely training and
testing NLP tools of different (and more recent) type than those used by Ponti and Passarotti
[23]. To determine which parsers to consider, we refer to the CoNLL Shared Task 2018 on
Multilingual Parsing from Raw Text to Universal Dependencies [28]. Among the systems that
took part in the Shared Task, we select the two (both neural) parsers that ranked highest with
respect to Latin, and in particular to the IT-TB:




                                                   112
  • TurkuNLP: end-to-end full neural parsing pipeline, developed by Kanerva et al. [16];

  • ICS-PAS: a semi-supervised neural system developed in Warsaw by Rybak and Wróblewska
    [24].

  The two neural parsers are run on the same dataset on which DeSR was trained and tested
in Sections 3 and 4, in order to evaluate the specific impact that neural parsing has on the
accuracy rates.

6.1. TurkuNLP
TurkuNLP [16] is a neural pipeline that performs four tasks: segmentation, morphological
tagging, parsing and lemmatisation.
   Lemmatisation is carried out thanks to a novel approach that exploits the OpenNMT neural
machine toolkit [17]. As for parsing, the tool is based on Stanford’s parser by Dozat, Qi, and
Manning [13], which ranked highest in the CoNLL Shared Task 2017 on Multilingual Pars-
ing from Raw Text to Universal Dependencies [29]. First, a word encoder embeds tokens by
summing together a set of learned token embeddings, pretrained token embeddings, and token
embeddings encoded from the sequence of its characters by using unidirectional LSTM. Then,
token embeddings are embedded with Part-of-Speech embeddings as well. Afterwards, repre-
sentations of tokens in context are created, building relations and attachments in dependency
trees. See [12] for further details.
   We begin by training a new model of TurkuNLP on the IT-TB extended dataset (T2). To
this end, we first employ the pre-trained word embeddings for Latin, published by Facebook
and developed with the fastText tool [6]. We both test the embeddings trained on Wikipedia3
[6] and their newer version [14], trained on Wikipedia and Common Crawl.4 Afterwards,
we develop another model on T2 by using our own embeddings for the Index Thomisticus
(IT), which we create with fastText (default settings) [6] from the opera omnia of Thomas
Aquinas provided by the IT corpus [8]. We build two kinds of IT embeddings: (1) token-based
embeddings (stored in a one-token-per-line format); (2) sentence-based embeddings (stored in
a one-sentence-per-line format).
   All the trained models are then evaluated with respect to the test set described in Section
3. Table 4 shows the results (LAS, UAS) obtained by TurkuNLP in comparison to the best
performing settings of DeSR and to the best performing combination pipeline (C4).
   The models that exploit the Facebook embeddings and the IT sentence-based embeddings
prove to perform best, with the IT sentence-based embeddings (LAS: 82.7, UAS: 85.9) outper-
forming their token-based counterpart (LAS: 82.1, UAS: 85.5). However, as it clearly emerges
from Table 4, TurkuNLP proves to obtain significantly lower accuracy rates than both DeSR
(MLP, r and SVM, r) and C4, especially in terms of UAS.

6.2. ICS-PAS
ICS-PAS [24] is a neural system consisting of a jointly trained tagger, lemmatiser, and depen-
dency parser. A cross-entropy loss function predicts the output dependency tree. To avoid


   3
       https://github.com/facebookresearch/fastText/blob/master/docs/pretrained-vectors.md.
   4
       https://fasttext.cc/docs/en/crawl-vectors.html.




                                                    113
Table 4
Results on T2 by different trained models of TurkuNLP compared to DeSR and C4
                                                        LAS     UAS
                              Facebook embeddings       82.6    85.8
                              Newer embeddings          81.8    85.2
                              IT token embeddings       82.1    85.5
                              IT sentence embeddings    82.7    85.9
                              DeSR (MLP, r)             83.87   88.72
                              DeSR (SVM, r)             83.92   88.41
                              C4                        87.37   91.56


Table 5
Results on T2 by different trained models of ICS-PAS compared to DeSR and C4
                                                        LAS     UAS
                              Facebook embeddings       82.0    85.5
                              Newer embeddings          82.1    85.6
                              IT token embeddings       81.5    82.2
                              IT sentence embeddings    82.2    85.7
                              DeSR (MLP, r)             83.87   88.72
                              DeSR (SVM, r)             83.92   88.41
                              C4                        87.37   91.56


cycles in predictions, a ‘cycle-penalty’ loss function is used. During both phases of arc predic-
tion and label prediction, head and dependent are represented as vectors. See [24] for further
details.
   We evaluate ICS-PAS in the same manner as TurkuNLP. We thus begin by training a model
on the extended IT-TB dataset (T2), employing the pre-trained fastText word embeddings for
Latin [6] and their newer version [14]. Two further models are then trained on T2, by using
respectively the token- and sentence-based embeddings of the IT-TB described in Section 6.1.
   Table 5 shows the results (LAS, UAS) obtained by testing ICS-PAS with our trained models.
ICS-PAS accuracy rates are displayed together with the accuracy rates provided by the best
performing settings of DeSR and the best performing combination pipeline.
   As illustrated in Table 5, ICS-PAS and TurkuNLP obtain extremely similar accuracy rates,
with TurkuNLP slightly outperforming ICS-PAS by some tenths of percent. The best settings
of DeSR and the C4 combination still provide better parsing performances.


7. A New Combination
As mentioned in Section 5, mutual difference between parsers can represent a concrete way to
improve their performances. The two parsers we selected from the CoNLL Shared Task 2018
[28] differ substantially from DeSR, particularly in their choice to employ embeddings and
implement neural systems. Such a sizeable difference between the parsers raises a question
about the performances they could reach if combined together. To answer such question,
we evaluate three different combinations of DeSR together with TurkuNLP and ICS-PAS, by




                                                114
Table 6
Results on T2 by new combinations of parsers
                                                    LAS      UAS
                                         CombA     87.65    91.66
                                         CombB     87.72    91.72
                                         CombC     89.44    92.85
                                         C4        87.37     91.56


applying the same algorithm for unweighted voting used in the experiment described in Section
5:

   • CombA: DeSR (MLP, r) + DeSR (SVM, r) + DeSR (MLP, l) + ICS-PAS;

   • CombB: DeSR (MLP, r) + DeSR (SVM, r) + DeSR (MLP, l) + Turku
     NLP;

   • CombC: DeSR (MLP, r) + DeSR (SVM, r) + DeSR (MLP, l) + ICS-PAS + TurkuNLP.

  As for TurkuNLP and ICS-PAS, we include in the combinations the outputs obtained with
the IT sentence-based embeddings, as they proved to achieve better performances than the
token-based ones.
  Table 6 reports the accuracy rates, in terms of LAS and UAS, provided by the three combi-
nations. Results obtained by C4, the best performing combination among the ones proposed
in [23], are displayed as well for comparison purposes.
  The results in Table 6 show how the combinations that include DeSR and, respectively,
ICS-PAS (CombA) and TurkuNLP (CombB) outperform the performances provided by the
two tools alone (see Subsections 6.1 and 6.2). In particular, CombA reaches 87.65 of LAS and
91.66 of UAS, while plain ICS-PAS exploiting IT sentence-based embeddings obtains 82.2 of
LAS and 85.7 of UAS. As for CombB, it obtains a LAS of 87.72 and a UAS of 91.72, while plain
TurkuNLP, using IT sentence-based embeddings, reaches a LAS of 82.7 and a UAS of 85.9.
The gap is very remarkable, being around +5 in terms of LAS and around +6 in terms of UAS.
Yet, a further, even more remarkable improvement is provided by CombC, that combines the
three different DeSR settings with both ICS-PAS and TurkuNLP. While CombA and CombB
achieve accuracy rates similar, although slightly higher, to the ones of C4, CombC outperforms
CombA and CombB by almost 2 points in LAS (89.44) and by more than 1 in UAS (92.85).
These results outperform of more than 2 points also those provided by the HIT-SCIR parser
in the 2018 CoNLL Shared Task, reported here in Section 2 (LAS: 87.08, UAS: 89.31).


8. In-depth Evaluation
Given the remarkable improvement of the quality of parsing obtained by combining different
parsers, we examine the specific contribution that they provide in the combination. We present
here an in-depth evaluation of the results achieved on the T2 test set, by focusing on a number
of relevant dependency relations and examining the parser-specific performances on them.5
   5
     The in-depth evaluation is performed by using the MaltEval evaluation tool for dependency parsers [18],
available at http://www.maltparser.org/malteval.html.




                                                    115
Table 7
LAS evaluation of CombC and its members for a selected set of dependency relations
    Deprel      CombC     DeSR (MLP,l)     DeSR (MLP,r)     DeSR (SVM,r)     ICS-PAS   TurkuNLP
    Adv           90.0         82.1             85.0             85.2          89.2      89.8
    Atr           92.0         85.1             86.7             87.8          91.1      91.0
    Atr_Co        65.0         50.5             52.1             54.4          68.9      70.9
    Atv           48.0         29.3             39.7             38.9          52.9      48.5
    AtvV          22.2         11.8             12.5             0.00          18.2      15.4
    AuxC          90.5         78.5             84.6             82.8          86.1      87.7
    AuxP          90.1         81.5             85.6             86.5          88.3      88.8
    AuxZ          87.2         80.6             81.5             83.2          85.9      86.7
    Coord         85.0         75.0             77.6             73.4          82.5      82.8
    Obj           89.7         80.2             83.0             84.0          88.1      88.8
    Obj_Co        67.9         47.0             48.7             53.2          68.1      71.9
    Pnom          87.6         78.3             81.4             81.3          86.2      86.6
    Pred          98.9         96.4             97.2             96.6          95.1      95.3
    Pred_Co       93.2         85.6             85.4             85.4          89.9      88.5
    Sb            90.0         81.5             83.7             83.7          89.1      89.9
    Sb_Co         71.4         53.2             57.6             58.7          73.6      75.1


   As highlighted by the results reported in Table 7, TurkuNLP turns out to perform best on
most of the selected relations. Only in few cases (attributes: Atr; verbal attributes: Atv and
AtvV; main predicates: Pred; coordinated main predicates: Pred_Co), it is outperformed by
other parsers - mostly by ICS-PAS. Such remark does not match with what observed in Sub-
section 6.1, where TurkuNLP did not obtain high results with respect to the other parsers. A
deeper analysis of its performances, though, shows the main reason of such potential discrep-
ancy. In fact, TurkuNLP fails to handle the terminal punctuation of sentences (dependency
relation: AuxK). While the parser assigns the correct relation to terminal punctuation, it al-
ways fails to select the right head. Specifically, TurkuNLP scores a 0.00 LAS with respect
to terminal punctuation, whereas DeSR performs excellently (LAS between 98.6 and 100),
regardless of the adopted configuration. ICS-PAS behaves similarly to TurkuNLP, obtaining
a 0.8 LAS with respect to terminal punctuation.
   Not surprisingly, the main predicates of sentences (Pred) are the most easily recognised re-
lation (also when they appear in coordinated constructions: Pred_Co), as they concern nodes
that do not depend on another node, but on the root of the tree (represented by a technical
node assigned relation AuxS). Conversely, the treatment of coordinated constructions repre-
sents an issue, as usual in dependency parsing. All parsers included in CombC provide quite
low accuracy rates for what concerns coordinated dependency relations, namely coordinated
attributes (Atr_Co), objects (Obj_Co) and subjects (Sb_Co). Another tricky relation is rep-
resented by verbal attributes not participating in verb government (Atv and AtvV), which
prove to be difficult for all parsers.
   After the main predicates, attributes (Atr) are the second best-handled relation. The LAS for
adverbials (Adv), subjects (Sb) and objects (Obj) are still high and very similar to each other
(around 90.0 in CombC). Predicate nominals (Pnom), instead, seem to be more difficult to
recognise (CombC: 87.6) and their LAS shows remarkable differences between parsers, ranging
from 78.3 (DeSR (MLP, l)) to 86.6 (TurkuNLP).
   With respect to subordinating conjunctions (AuxC), prepositions (AuxP) and coordinating




                                                 116
nodes (Coord), no parser obtains high results, although all words that are assigned these
relations belong to closed lexical classes, which should make them easier to spot and parse.
The syntactic ambiguity of some of these words could play a role in such trend. Consider,
for instance, the following: (a) cum, which can be both a subordinating conjunction (AuxC)
and a preposition with meaning ‘with’ (AuxP); (b) nec, which can syntactically behave like a
coordination meaning ‘and not’ (Coord) or an emphasising word meaning ‘not even’ (AuxZ);
(c) et, which can have the syntactic function of a coordination meaning ‘and’ (Coord) or of an
emphasising word meaning ‘also’ (AuxZ).
   The parser-based in-depth evaluation above shows the added value of combining different
tools, as a means to efficiently exploit the parser-specific contribution to achieve a substantial
improvement of the accuracy rates. To give an example of the specific contribution provided by
the single parsers to their combination, Figure 1 shows the dependency trees produced by five
parsers and by their combination for the following sentence taken from the IT-TB: Ergo licet
aliquid de forma subtrahere ‘Therefore, it is permitted to leave something out of the form’.6
   From the top to the bottom, Figure 1 lists the Gold Standard and the outputs predicted
respectively by th CombC combination (as the best performing one: see Table 6) and by the
following parsers: DeSR (MLP, l), DeSR (MLP, r), DeSR (SVM, r), ICS-PAS and TurkuNLP.
   In Figure 1, correct dependency relations and labels are displayed in green, while in red are
the incorrect ones. Given an ordered set of parsing outputs, the combined output is built by
selecting the value proposed by the majority of parsers. For instance, ICS-PAS erroneously
attaches the token licet to ergo instead of the root, and assigns the AuxC relation to their
dependency. On its turn, TurkuNLP correctly individuates the root of the tree as the head
node for licet, but fails in labelling the relation (assigning AuxC instead of Pred). However,
CombC succeeds in predicting both the arc and the label for the dependency in question, by
choosing the output proposed by the majority of the parsers (namely, DeSR in all its three
configurations). The same can be observed with respect to the full stop at the end of the
sentence. Even though TurkuNLP and ICS-PAS fail to attach it to the correct head (the root
of the tree), in the prediction made by CombC the terminal full stop is made dependent on the
correct head node and is assigned the right relation (AuxK), thanks to the correct prediction
of DeSR.
   Moreover, from Figure 1 we can observe how the different types of parsers here concerned
(two neural ones vs DeSR) tend to make the same (or similar) mistakes. For instance, ter-
minal punctuation is attached to the wrong head by both the neural parsers (ICS-PAS and
TurkuNLP), which also fail in considering subtrahere an ellipsis (ExD) and licet a subordinat-
ing conjunction (AuxC). The same errors are not made by any of the three configurations of
DeSR, which correctly analyse both licet and the terminal punctuation mark.


9. Conclusion and Future Work
In this paper we presented various experiments on automatic dependency parsing of the Index
Thomisticus Treebank. We began by replicating some of the experiments described in [23], in
order to evaluate how a larger dataset impacts parsing results. To this end, we first tested
different algorithms and settings of the parser DeSR (MLP left-to-right, MLP right-to-left,
SVM right-to-left). Then, we evaluated two combinations between the outputs of DeSR and
   6
     Scriptum super Sententiis Petri Lombardi, Liber IV, Distinctio 3, Quaestio 1, Articulus 2, Quaestiuncula
3. Translation from https://isidore.co/aquinas.




                                                    117
Figure 1: Gold standard and parsing predictions by five different parsers and their combination for the same
sentence from the IT-TB


other parsers (C3, C4). Results show that the larger dataset improves the accuracy rates of
parsing with respect to those reported in [23].
   We then trained and tested two recently available neural parsers, so as to assess how and
if such approach affects the accuracy rates. Although the two selected parsers (ICS-PAS and
TurkuNLP) had ranked highest in parsing the IT-TB at the CoNLL Shared Task 2018 on Mul-
tilingual Parsing, in our experiment they provided lower accuracy rates than the most accurate
DeSR settings (MLP, right-to-left), and substantially lower rates than the C4 combination.
   Lastly, we applied a post-processing technique of combination to verify if and to what extent
combining together different types of parsers would result in higher accuracy results. The




                                                    118
combination that joins the outputs of three DeSR settings, ICS-PAS and TurkuNLP (CombC)
resulted in a substantial enhancement of parsing performances, in terms of both LAS (+2.07)
and UAS (+1.29).7
   In the near future, we plan to build and test a new set of sentence-/token-based embeddings
for the IT-TB by using specifically defined parameters, instead of the default ones. The
experiments on parsing the IT-TB described in this paper are just one piece of the much
larger picture of dependency parsing of the Latin language. Such picture features two main,
important variables.
   First, the high level of diversity of Latin texts, which are spread all over (what today is
called) Europe across a period of more than two millennia, heavily affects the diatopic and
diachronic portability of trained model of probabilistic NLP tools.
   Second, like most of the Latin treebanks, also the IT-TB is available both in its native
annotation style and in the UD one, which is nowadays a standard de facto in syntactic
(dependency) annotation.
   As for the former, although the results of the work presented in this paper are very promising
for the specific needs of the IT-TB project, they must be taken carefully when talking about
Latin parsing in general terms. Indeed, in the near future it will be necessary to test techniques
of domain-adaptation of the available trained models, in order to restrain the decrease of the
accuracy rates when models are applied to texts of a different era and/or genre than those of
their training set.
   As for the latter, there are several initiatives in support of parsing the UD treebanks, like
the UDPipe tool8 [26] and the various shared tasks on UD parsing at international conferences
like CoNLL and IWPT.9
   Finally, in the coming years, one edition of the EvaLatin evaluation campaign of the NLP
tools for Latin will include a task specifically devoted to syntactic dependency parsing.10


Acknowledgments
This project has received funding from the European Research Council (ERC) under the Euro-
pean Union’s Horizon 2020 research and innovation programme - Grant Agreement No 769994.


References
 [1]   G. Attardi. “Experiments with a Multilanguage Non-Projective Dependency Parser”.
       In: Proceedings of the Tenth Conference on Computational Natural Language Learning
       (CoNLL-X). 2006, pp. 166–170.



     7
       All models, datasets, outputs and scripts that either we used to perform the experiments described in this
paper, or that result from them, are openly available at https://github.com/CIRCSE/IT-TB_Parsing.
     8
       https://ufal.mff.cuni.cz/udpipe.
     9
       See the webpage of the UD-related events at https://universaldependencies.org/events.html. The best
performing system on the IT-TB at the CoNLL 2018 Shared Task (HIT-SCIR [11]) provided a LAS of 87.08.
The results of the competition are available at http://universaldependencies.org/conll18/results-las.html.
    10
       Information on the first edition of EvaLatin, dedicated to lemmatisation and Part-of-Speech tagging, can be
found at https://circse.github.io/LT4HALA/EvaLatin. An overview of the results of the evaluation campaign
is provided by [25].




                                                      119
 [2]   D. Bamman and G. Crane. “The Latin Dependency Treebank in a Cultural Heritage
       Digital Library”. In: Proceedings of the Workshop on Language Technology for Cultural
       Heritage Data (LaTeCH 2007). 2007, pp. 33–40.
 [3]   D. Bamman, M. Passarotti, R. Busa, and G. Crane. “The Annotation Guidelines of the
       Latin Dependency Treebank and Index Thomisticus Treebank: the Treatment of some
       specific Syntactic Constructions in Latin”. In: Proceedings of the Sixth International
       Conference on Language Resources and Evaluation (LREC’08). Marrakech, Morocco:
       European Language Resources Association (ELRA), 2008.
 [4]   B. Bohnet. “Very High Accuracy and Fast Dependency Parsing is not a Contradiction”.
       In: Proceedings of the 23rd International Conference on Computational Linguistics (Col-
       ing 2010). Beijing, China, 2010, pp. 89–97.
 [5]   B. Bohnet, J. Nivre, I. Boguslavsky, R. Farkas, F. Ginter, and J. Hajič. “Joint Morpho-
       logical and Syntactic Analysis for Richly Inflected Languages”. In: Transactions of the
       Association for Computational Linguistics 1 (2013), pp. 415–428. doi: 10.1162/tacl\_a\
       _00238.
 [6]   P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. “Enriching Word Vectors with
       Subword Information”. In: Transactions of the Association for Computational Linguistics
       5 (2017), pp. 135–146.
 [7]   S. Buchholz and E. Marsi. “CoNLL-X Shared Task on Multilingual Dependency Parsing”.
       In: Proceedings of the Tenth Conference on Computational Natural Language Learning
       (CoNLL-X). New York City, 2006, pp. 149–164.
 [8]   R. Busa. “Index Thomisticus Sancti Thomae Aquinatis Operum Omnium Indices Et
       Concordantiae in Quibus Verborum Omnium Et Singulorum Formae Et Lemmata Cum
       Suis Frequentiis Et Contextibus Variis Modis Referuntur”. In: (1974).
 [9]   F. M. Cecchini, R. Sprugnoli, G. Moretti, and M. Passarotti. “UDante: First Steps To-
       wards the Universal Dependencies Treebank of Dante’s Latin Works”. In: Seventh Italian
       Conference on Computational Linguistics. CEUR Workshop Proceedings. 2020, pp. 1–7.
[10]   F. M. Cecchini, T. Korkiakangas, and M. Passarotti. “A New Latin Treebank for Univer-
       sal Dependencies: Charters between Ancient Latin and Romance Languages”. In: Pro-
       ceedings of the 12th Language Resources and Evaluation Conference. Marseille, France:
       European Language Resources Association, 2020.
[11]   W. Che, Y. Liu, Y. Wang, B. Zheng, and T. Liu. “Towards Better UD Parsing: Deep
       Contextualized Word Embeddings, Ensemble, and Treebank Concatenation”. In: Proceed-
       ings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal
       Dependencies. Brussels, Belgium, 2018, pp. 55–64.
[12]   T. Dozat and C. D. Manning. “Deep Biaffine Attention for Neural Dependency Parsing”.
       In: arXiv preprint arXiv:1611.01734 (2016).
[13]   T. Dozat, P. Qi, and C. D. Manning. “Stanford’s Graph-based Neural Dependency Parser
       at the CoNLL 2017 Shared Task”. In: Proceedings of the CoNLL 2017 Shared Task:
       Multilingual Parsing from Raw Text to Universal Dependencies. Vancouver, Canada,
       2017. doi: 10.18653/v1/K17-3002.




                                             120
[14]   E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov. “Learning Word Vectors
       for 157 Languages”. In: Proceedings of the Eleventh International Conference on Language
       Resources and Evaluation (LREC 2018). Miyazaki, Japan: European Language Resources
       Association (ELRA), 2018.
[15]   D. T. Haug and M. Jøhndal. “Creating a Parallel Treebank of the Old Indo-European
       Bible Translations”. In: Proceedings of the Second Workshop on Language Technology for
       Cultural Heritage Data (LaTeCH 2008). 2008, pp. 27–34.
[16]   J. Kanerva, F. Ginter, N. Miekka, A. Leino, and T. Salakoski. “Turku Neural Parser
       Pipeline: An End-to-End System for the CoNLL 2018 Shared Task”. In: Proceedings of
       the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Depen-
       dencies. Brussels, Belgium, 2018, pp. 133–142.
[17]   G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush. “OpenNMT: Open-Source Toolkit
       for Neural Machine Translation”. In: Proceedings of ACL 2017, System Demonstrations.
       Vancouver, Canada: Association for Computational Linguistics, 2017, pp. 67–72.
[18]   J. Nilsson and J. Nivre. “MaltEval: an Evaluation and Visualization Tool for Dependency
       Parsing”. In: Proceedings of the Sixth International Conference on Language Resources
       and Evaluation (LREC’08). Marrakech, Morocco: European Language Resources Asso-
       ciation (ELRA), 2008.
[19]   J. Nivre, M.-C. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C. D. Manning, R. Mc-
       Donald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. “Universal De-
       pendencies v1: A Multilingual Treebank Collection”. In: Proceedings of the Tenth In-
       ternational Conference on Language Resources and Evaluation (LREC’16). Portorož,
       Slovenia: European Language Resources Association (ELRA), 2016, pp. 1659–1666. url:
       https://universaldependencies.org.
[20]   M. Passarotti. “The Project of the Index Thomisticus Treebank”. In: Digital Classical
       Philology 10 (2019), pp. 299–320. doi: 10.1515/9783110599572-017.
[21]   M. Passarotti and F. Dell’Orletta. “Improvements in Parsing the Index Thomisticus
       Treebank. Revision, Combination and a Feature Model for Medieval Latin”. In: Proceed-
       ings of the Seventh International Conference on Language Resources and Evaluation
       (LREC’10). Valletta, Malta: European Language Resources Association (ELRA), 2010.
[22]   M. Passarotti and P. Ruffolo. “Parsing the Index Thomisticus Treebank. Some Pre-
       liminary Results”. In: 15th International Colloquium on Latin Linguistics. Innsbrucker
       Beiträge zur Sprachwissenschaft. 2010, pp. 714–725.
[23]   E. M. Ponti and M. Passarotti. “Differentia compositionem facit. A Slower-paced and
       Reliable Parser for Latin”. In: Proceedings of the Tenth International Conference on
       Language Resources and Evaluation (LREC’16). 2016, pp. 683–688.
[24]   P. Rybak and A. Wróblewska. “Semi-Supervised Neural System for Tagging, Parsing and
       Lematization”. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from
       Raw Text to Universal Dependencies. Brussels, Belgium, 2018, pp. 45–54.
[25]   R. Sprugnoli, M. Passarotti, F. M. Cecchini, and M. Pellegrini. “Overview of the EvaLatin
       2020 Evaluation Campaign”. In: Proceedings of LT4HALA 2020 - 1st Workshop on Lan-
       guage Technologies for Historical and Ancient Languages. Marseille, France: European
       Language Resources Association (ELRA), 2020, pp. 105–110.




                                              121
[26]   M. Straka and J. Straková. “Tokenizing, POS Tagging, Lemmatizing and Parsing UD
       2.0 with UDPipe”. In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing
       from Raw Text to Universal Dependencies. Vancouver, Canada, 2017, pp. 88–99.
[27]   M. Surdeanu and C. D. Manning. “Ensemble Models for Dependency Parsing: Cheap
       and Good?” In: Human Language Technologies: The 2010 Annual Conference of the
       North American Chapter of the Association for Computational Linguistics. Los Angeles,
       California, 2010, pp. 649–652.
[28]   D. Zeman, J. Hajič, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S.
       Petrov. “CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal
       Dependencies”. In: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing
       from Raw Text to Universal Dependencies. Brussels, Belgium, 2018, pp. 1–21. doi: 10.
       18653/v1/K18-2001.
[29]   D. Zeman, M. Popel, M. Straka, J. Hajic, J. Nivre, F. Ginter, J. Luotolahti, S. Pyysalo, S.
       Petrov, M. Potthast, F. Tyers, E. Badmaeva, M. Gokirmak, A. Nedoluzhko, S. Cinkova,
       J. Hajic jr., J. Hlavacova, V. Kettnerová, Z. Uresova, J. Kanerva, S. Ojala, A. Missilä,
       C. D. Manning, S. Schuster, S. Reddy, D. Taji, N. Habash, H. Leung, M.-C. de Marneffe,
       M. Sanguinetti, M. Simi, H. Kanayama, V. dePaiva, K. Droganova, H. Martínez Alonso,
       Ç. Çöltekin, U. Sulubacak, H. Uszkoreit, V. Macketanz, A. Burchardt, K. Harris, K.
       Marheinecke, G. Rehm, T. Kayadelen, M. Attia, A. Elkahky, Z. Yu, E. Pitler, S. Lert-
       pradit, M. Mandl, J. Kirchner, H. F. Alcalde, J. Strnadová, E. Banerjee, R. Manurung,
       A. Stella, A. Shimada, S. Kwak, G. Mendonca, T. Lando, R. Nitisaroj, and J. Li. “CoNLL
       2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies”. In:
       Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to
       Universal Dependencies. Vancouver, Canada, 2017, pp. 1–19.




                                               122