=Paper= {{Paper |id=Vol-1749/paper54 |storemode=property |title=Tree Kernels–based Discriminative Reranker for Italian Constituency Parsers |pdfUrl=https://ceur-ws.org/Vol-1749/paper54.pdf |volume=Vol-1749 |authors=Antonio Uva,Alessandro Moschitti |dblpUrl=https://dblp.org/rec/conf/clic-it/0002M16 }} ==Tree Kernels–based Discriminative Reranker for Italian Constituency Parsers== https://ceur-ws.org/Vol-1749/paper54.pdf
                     Tree Kernels-based Discriminative Reranker
                           for Italian Constituency Parsers

                          Antonio Uva† and Alessandro Moschitti
                        †
                       DISI, University of Trento, 38123 Povo (TN), Italy
                 Qatar Computing Research Institute, HBKU, 5825 Doha, Qatar
                 antonio.uva@unitn.it amoschitti@gmail.com


                    Abstract                           Mazzei, 2011). However, the accuracy reported
    English. This paper aims at filling the            for the best parser is still far behind the state of the
    gap between the accuracy of Italian and            art of other languages, e.g., English.
    English constituency parsing: firstly, we             One noticeable attempt to fill this technolog-
    adapt the Bllip parser, i.e., the most ac-         ical gap was carried out in the EvalIta chal-
    curate constituency parser for English,            lenge, which proposed a parsing track on both
    also known as Charniak parser, for Ital-           dependency and constituency parsing for Italian.
    ian and trained it on the Turin Univer-            Among the several participant systems, the Berke-
    sity Treebank (TUT). Secondly, we design           ley parser (Petrov and Klein, 2007) gave the best
    a parse reranker based on Support Vec-             result (Lavelli and Corazza, 2009; Lavelli, 2011).
    tor Machines using tree kernels, where the            At the beginning, the outcome for constituency
    latter can effectively generalize syntactic        parsing computed on TUT (Bosco et al., 2009)
    patterns, requiring little training data for       was much lower than the one obtained for English
    training the model. We show that our               on the Penn Treebank (Marcus et al., 1993). In
    approach outperforms the state of the art          the last EvalIta edition, such gap diminished as the
    achieved by the Berkeley parser, improv-           Italian parser labeled F1 increased from 78.73%
    ing it from 84.54 to 86.81 in labeled F1.          (EvalIta 2009) to 82.96% (EvalIta 2011). Some
                                                       years later the parser F1 improved to 83.27%
    Italiano.    Questo paper mira a col-              (Bosco et al., 2013). However, the performance
    mare il gap di accuratezza tra il con-             of the best English parser (McClosky et al., 2006),
    stituency parsing dell’Italiano e quello In-       i.e., 92.1%, is still far away. The main rea-
    glese: come primo miglioramento, abbi-             son for such gap is the difference in the amount
    amo adattato il parser a costituenti per           of training data available for Italian compared to
    l’Inglese, Bllip, anche noto come Char-            English. In fact, while Penn Treebank contains
    niak parser, per l’Italiano e lo abbi-             49, 191 sentences/trees, TUT only contains 3, 542
    amo addestrato sul Turin University Tree-          sentences/trees.
    bank. In seguito, abbiamo progettato un               In presence of scarcity of training data, a gen-
    reranker basato sulle Macchine a Vettori           eral solution for increasing the accuracy of a ma-
    di Supporto che usano kernel arborei, i            chine learning-based system is the use of more
    quali possono efficacemente generalizzare          general features. This way, the probability of
    pattern sintattici, richiedendo pochi dati         matching training and testing instance representa-
    di training per addestrare il modello. Il          tions is larger, allowing the learning process to find
    nostro approccio supera lo stato dell’arte         more accurate optima. In case of syntactic pars-
    ottenuto con il Berkeley parser, miglio-           ing, we need to generalize either lexical or syntac-
    rando la labeled F1 da 84.54 a 86.81.              tic features, or possibly both. However, modeling
1   Introduction                                       such generalization in state-of-the-art parser algo-
                                                       rithms such as the Bllip1 (Charniak, 2000; Char-
Constituency Syntactic parsing is one of the most
                                                       niak and Johnson, 2005) is rather challenging. In
important research lines in Computational Lin-
                                                       particular, the space of all possible syntactic pat-
guistics. Consequently, a large body of work has
                                                       terns is very large and cannot be explicitly coded
been also devoted to its design for Italian language
                                                          1
(Bosco et al., 2007; Bosco et al., 2009; Bosco and            https://github.com/BLLIP/bllip-parser
in the model. An easier solution consists in us-          2.1 Adapting Bllip to Italian Language
ing such features in a simpler model, which can be        Bllip adaptation required to create various config-
trained to improve the outcome of the main parser,        uration files. For example, PoS and bracket labels
e.g., selecting one of its best hypotheses. In partic-    observed in training and development sets must be
ular, tree kernels (TKs) by Moschitti (2006) can be       defined in a file named terms.txt. As labels present
used for encoding an exponential number of syn-           in the TUT are different from those of the Penn
tactic patterns in parse rerankers.                       Treebank2 , we added them in such file. Then, we
   In this work, we aim at filling the gap between        specified the type of labels present in the data, i.e.,
English and Italian constituency parsing: firstly,        constituent type, open-class PoS, punctuation, etc.
we adapted Bllip parser, i.e., the most accurate             Finally, it should be noted that, since Bllip is
constituency parser for English, also known as            lexicalized, head-finding rules for Italian should
Charniak parser, for Italian and trained it on TUT.       be specified in the file, headInfo.txt. For example,
                                                                              r
We designed various configuration files for defin-        the rule, ADJP →   − JJ, specifies that the head of
ing specific labels for TUT by also defining their        an adjective phrase (ADJP) is the right-most ad-
type, although we did not encode head-finding             jective (JJ). Due to time restriction, we used the
rules for Italian, needed to complete the parser          default Bllip rules and leave this task as our short-
adaptation.                                               term future work.
   Secondly, we apply rerankers based on Support          3     Tree Kernel-based Reranker
Vector Machines (SVMs) using TKs to the k-best
                                                          We describe three types of TKs and the Preference
parses produced by Bllip, with the aim of select-
                                                          Reranker approach using them.
ing its best hypotheses. TKs allow us to represent
data using the entire space of subtrees, which cor-       3.1 Tree kernels
respond to syntactic patterns of different level of       TKs can be used for representing arbitrary tree
generality. This representation enables the train-        structures in kernel machines, e.g., SVMs. They
ing of the reranker with little data. Finally, we         are a viable alternative to explicit feature design
tested our models on TUT, following the EvalIta           as they implement the scalar products between
setting and compare with other parsers. For ex-           feature vectors as a similarity between two trees.
ample, we observed an improvement of about 2%,            Such scalar product is computed using efficient al-
over the Berkeley parser, i.e., 86.81 vs. 84.54.          gorithms and it is basically equal to the number of
                                                          the common subparts of the two trees.
2   Bllip parser                                             Syntactic Tree Kernels (STK) count the num-
The Bllip parser is a lexicalized probabilistic con-      ber of common tree fragments, where the latter (i)
stituency parser. It can be considered a smoothed         contain more than two nodes and (ii) each node is
PCFG, whose non-terminals encode a wide vari-             connected to either all or none of its children. We
ety of manually chosen conditioning information,          also used a variant, called STKb , which adds the
such as heads, governors, etc. Such information is        number of common leaves of the comparing trees
used to derive probability distributions, which, in       in the final subpart count.
turn, are utilized for computing the likelihood of           Partial Tree Kernels (PTK) counts a larger
constituency trees being generated. As described          class of tree fragments, i.e., any subset of nodes,
by McClosky et al. (2006), Bllip uses five distri-        where the latter are connected in the original trees:
butions, i.e., the probabilities of the (i) constituent   clearly, PTK is a generalization of STK.
heads, (ii) constituent part-of-speeches (PoS), (iii)     3.2    Preference Reranker
head-constituents, (iv) left-of-head and (v) right-
of-head constituents. Each probability distribu-          Preference reranking is cast as a binary classifica-
tion is conditioned by five or more features and          tion problem, where each instance is a pair hhi , hj i
backed-off by the probability of lower-order mod-         of tree hypotheses and the classifier decides if hi
els in case of rare feature configurations. The           is better than hj . The positive training examples
variety of information needed by Bllip to work            are the pairs, hh1 , hi i, where h1 has the highest
properly makes its configuration much harder than         F1 with respect to the gold standard among the
for other parsers, e.g., the Berkeley’ one. How-          candidate hypotheses. The negative examples are
ever, Bllip is faster to train than other off-the-shelf      2
                                                               For example, the PoS-tag NN in Penn Treebank corre-
parsers.                                                  sponds to tag NOU∼CS in TUT
                                                  sentences ≤ 40 words                All sentences
             Models
                                              LR       LP      LF     EMR      LR      LP       LF      EMR
             Berkeley (Bosco et al., 2013)   83.45 84.48 83.96 24.91          82.78   83.76 83.27       23.67
             Berkeley (our model)            85.31 85.76 85.53 27.76          84.35   84.72 84.54       26.33
             Bllip base model                85.90 86.67 86.28 29.54          85.26   85.97 85.61       28.00
             STK                             86.16 87.02 86.59 30.96          85.73   86.38 86.05       29.33
             STKb                            86.36 87.21 86.78 31.67          85.89   86.53 86.21       30.00
             PTK                             86.82 87.95 87.38 30.96          86.33   87.29 86.81       29.67

Table 1: Comparative results on the test set. LR/LP/LF = labeled recall/precision/F 1. EMR = percentage of sentences where
recall and precision are 100%. STK- and STKb -based rerankers use 20-best hypotheses, while PTK-based reranker use 30-best
hypotheses.

obtained inverting the hypotheses in the pairs, i.e.,          the base parser (trained on all TUT training data)
hhi , h1 i. If the hypotheses have the same score,             to the TUT test set and generate n-hypotheses for
the pair is not included in the training set. At clas-         each sentence.
sification time all pairs hhi , hj i generated from the        SVM Reranker. We train the reranker using
k-best hypotheses are classified. A positive classi-           SVM-light-TK, which takes both feature vectors
fication is a vote for hi , whereas a negative classi-         and trees as input to learn a classification model.
fication is a vote for hj . The hypothesis associated          The features used for reranking constituency trees
with the highest number of votes (or highest sum               are: (i) the probability and the (inverse) rank of
of classifier scores) is selected as the best parse.           the hypotheses provided by Bllip and (ii) the en-
                                                               tire syntactic trees used with two types of kernels,
4       Experiments
                                                               STK and PTK, described in Sec. 3.
In these experiments, we first report on the per-              Measures. For evaluating the parsers, we used
formance of Bllip for Italian and compare it with              the EVALB scoring program, which reports the
the Berkeley parser. Then, we show that our parse              Labeled Precision (LP), Labeled Recall (LR), La-
reranker can be very effective, even in case of use            beled F1 (LF) and Exact Match Rate (EMR). Ac-
of small training data.                                        cording to the official EvalIta procedure for eval-
                                                               uating the participant system output, we did not
4.1      Experimental Setup
                                                               score the TOP label, ignore all functional labels
Parsing data. The data for training and test-                  attached to non-terminals and include punctuation
ing the constituency parsers come from the TUT                 in the scoring procedure.
project 3 , developed at the University of Turin.              4.2 Bllip base parser results
There have been several releases of the dataset:
                                                               We divided the training set in train and validation
we used the latest version from EvalIta 2011. The
                                                               sets, where the latter is composed of the last 50
training set is composed of 3, 542 sentences, while
                                                               sentences of each of the six sections of the former
the test set contains 300 sentences. The set of
                                                               for a total of 300 sentences. We train the models
PoS-tags includes 97 tags: 68 encoding morpho-
                                                               on the training set and tune parameters on the val-
logical features (out of which 19 basic tags) for
                                                               idation set. Then, we applied the learned model
pre-terminal symbols (e.g., ADJ, ADVB, NOUN,
                                                               to the 300 sentences of the test set. Table 1 shows
etc.) and 29 non-terminal symbols for phrase con-
                                                               the results obtained by the Bllip base parser on the
stituents (e.g., ADJP, ADVP, NP, etc.).
                                                               TUT test set. Our parser obtained an LF of 86.28%
Reranking Data. To generate the data for train-                for sentences with less than 40 words and a score
ing the reranker, we apply 10-fold cross validation            of 85.61% for all sentences.
to the official TUT training set: we train the based
                                                               4.3 Comparison with the Berkeley parser
parser on 9 folds and applied it to the remaining
                                                               Table 1 also reports the results of the Berkeley
fold to generate the n-best trees for each of its sen-
                                                               parser obtained by Bosco et al. (2013). For com-
tences. Then, we merge all the 10-labeled folds
                                                               parison purposes, we trained our own version of
to produce the training set of the reranker. This
                                                               the Berkeley parser. In particular, we trained the
way, we avoid the bias a parser would have if ap-
                                                               parser for 5 split-merge cycles on the whole train-
plied to the data used for training it. For generat-
                                                               ing set. We selected such number of cycles ap-
ing the test data of the reranker, we simply apply
                                                               plying 10-fold cross validation on the training set.
    3
        http://www.di.unito.it/˜tutreeb/                       Similarly to Bosco et al. (2013), we specialized
                                 10-best                              20-best                               30-best
 Models                   Tree           Tree + feat.         Tree            Tree + feat.          Tree            Tree + feat.
                    len≤40      All    ≤40        All   len≤40      All    len≤40      All    len≤40      All    len≤40      All
 Bllip base model    86.28     85.61 86.28 85.61         86.28     85.61    86.28     85.61    86.28     85.61    86.28     85.61
 STK                 84.95     84.49 86.31 85.69         84.70     84.16    86.59     86.05    84.99     84.45    86.55     86.00
 STKb                85.05     84.52 86.31 85.69         84.92     84.35    86.78     86.21    84.92     84.38    86.62     86.06
 PTK                 86.02     85.46 87.34 86.65         85.89     86.41    87.37     86.79    86.42     85.92    87.38     86.81

Table 2: Reranker performance: the first row reports the number n of the best hypotheses used during training. The second
row shows the used group of features: Tree or Tree + feat, while the third row illustrates the parse results (LF) for two sentence
groups: sentences with ≤ 40 words and all sentences.

punctuation symbols to more specific tags. How-                    fact is the following: while PTK gives better re-
ever, we used full PoS-tags, as they gave the best                 sults in terms of LF, STK and STKb perform bet-
results in cross-fold validation. Indeed, the table                ter in terms of EMR, i.e., the percentage of sen-
shows that our own version of the Berkeley parser                  tence parse completely matching gold trees. This
outperforms the version the one of Bosco et al.                    is rather intuitive as the name suggests, Partial
(2013) by 1.27 absolute percent points (84.54 vs.                  Tree Kernel generates partial subtrees, i.e., partial
83.27). The table also reports the results of the                  production rules as patterns. On one hand, this
Bllip parser, which outperforms the best result ob-                can improve the ability of matching syntactic pat-
tained by the Berkeley parser by 1.07% in LF, i.e.,                terns, thus capturing rules partially expressed by
85.61 vs. 84.54.                                                   more than one support vector. On the other hand,
                                                                   the precision in capturing complete patterns, i.e.,
4.4    Reranking using different TKs                               regarding a complete tree is intuitively decreased.
Table 2 reports the LF obtained by different                       5    Related Work and Conclusions
reranking models, varying: (i) the type of TKs,
                                                                   This work was inspired by Collins and Duffy
(ii) the group of features (i.e., either trees or trees
                                                                   (2002) and Collins and Koo (2005), who explored
+ feat.) and (iii) the number, n, of parse trees used
                                                                   discriminative approaches for ranking problems.
to generate the reranker training data. More in par-
                                                                   Their studies were limited to WSJ, though, and
ticular, we experimented with three values for n,
                                                                   did not explore the use of max-margin classifiers,
i.e., 10-, 20- and 30-best parse trees. As it can be
                                                                   i.e., SVMs. The first experiments with SVMs and
seen from the table, PTK constantly outperforms
                                                                   TKs were conducted by Shen and Joshi (2003),
STK and STKb for any number of parse hypothe-
                                                                   who proposed a new SVM-based voting algorithm
ses. This indicates that the subtree features gener-
                                                                   making use of preference reranking.
ated by PTK, which include nodes with any sub-
                                                                      In this paper, we adapted the Charniak parser
set of the children in the original tree, are useful
                                                                   for Italian gaining an improvement of 1.07% over
for improving the parser accuracy.
                                                                   the Berkeley model (indicated by EvalIta as the
   Very interestingly, the performance of all mod-
                                                                   state of the art for Italian). Then, our TK-based
els when trained on 30-best trees give either worse
                                                                   reranker further improved it up to 2 absolute per-
results (e.g., STKb and STK) or very little im-
                                                                   cent points. It should also be noted that our best
provement (e.g., PTK) than training on 20-best
                                                                   reranking result is 3.54 absolute points better than
parse trees. This may suggest that adding too
                                                                   the best outcome reported in (Bosco et al., 2013),
many negative examples, largely populating the
                                                                   i.e., 83.27.
lower part of the n-best list may be detrimental.
                                                                      In the future, we would like to integrate (i) the
   The bottom part of Table 1 shows standard
                                                                   features developed in the reranking software avail-
parser evaluation metrics for different reranking
                                                                   able by Johnson and Ural (2010) in our model for
models using different kernel types: only the ker-
                                                                   further improving it, (ii) generalizing lexical fea-
nel models with the highest LF from Table 2 are
                                                                   tures (e.g., embeddings, brown clusters) and in-
reported. The method shows an 1.2% absolute im-
                                                                   cluding similarity measures in PTK, i.e., SPTK
provement in LF (from 85.61% to 86.81%) on all
                                                                   (Croce et al., 2011).
the sentences over the base-parser model (i.e., the
baseline) when using the most powerful kernel,                     Acknowledgments
PTK, and 30-best hypotheses. STK and STKb                          A special thank is due to Alberto Lavelli and
show a lower improvement over the baseline of                      Alessandro Mazzei for enabling us to carry out an
0.44% and 0.6%, respectively. One interesting                      exact comparison with their parser.
 References                                                [Lavelli2011] Alberto Lavelli. 2011. The berkeley
                                                              parser at the evalita 2011 constituency parsing task.
[Bosco and Mazzei2011] Cristina Bosco and Alessan-            In Working Notes of EVALITA.
   dro Mazzei. 2011. The evalita 2011 parsing task:
   the constituency track. Working Notes of EVALITA.       [Marcus et al.1993] Mitchell P Marcus, Mary Ann
                                                              Marcinkiewicz, and Beatrice Santorini.  1993.
[Bosco et al.2007] Cristina Bosco, Alessandro Mazzei,
                                                              Building a large annotated corpus of english:
   and Vincenzo Lombardo. 2007. Evalita parsing
                                                              The penn treebank. Computational linguistics,
   task: an analysis of the first parsing system contest
                                                              19(2):313–330.
   for italian. Intelligenza artificiale, 12:30–33.
[Bosco et al.2009] Cristina Bosco, Alessandro Mazzei,      [McClosky et al.2006] David McClosky, Eugene Char-
   and Vincenzo Lombardo. 2009. Evalita09 parsing             niak, and Mark Johnson. 2006. Effective self-
   task: constituency parsers and the penn format for         training for parsing. In Proceedings of the main con-
   italian. Proceedings of EVALITA, 9:1794–1801.              ference on human language technology conference
                                                              of the North American Chapter of the Association of
[Bosco et al.2013] Cristina Bosco, Alessandro Mazzei,         Computational Linguistics, pages 152–159. Associ-
   and Alberto Lavelli. 2013. Looking back to the             ation for Computational Linguistics.
   evalita constituency parsing task: 2007-2011. In
   Evaluation of Natural Language and Speech Tools         [Moschitti2006] Alessandro Moschitti. 2006. Efficient
   for Italian, pages 46–57. Springer.                        convolution kernels for dependency and constituent
                                                              syntactic trees. In European Conference on Machine
[Charniak and Johnson2005] Eugene Charniak and                Learning, pages 318–329. Springer.
   Mark Johnson. 2005. Coarse-to-fine n-best parsing
   and maxent discriminative reranking. In Proceed-        [Petrov and Klein2007] Slav Petrov and Dan Klein.
   ings of the 43rd Annual Meeting on Association              2007. Improved inference for unlexicalized parsing.
   for Computational Linguistics, pages 173–180.               In HLT-NAACL, volume 7, pages 404–411.
   Association for Computational Linguistics.
                                                           [Shen and Joshi2003] Libin Shen and Aravind K Joshi.
[Charniak2000] Eugene Charniak. 2000. A maximum-              2003. An svm based voting algorithm with appli-
   entropy-inspired parser. In Proceedings of the 1st         cation to parse reranking. In Proceedings of the
   North American chapter of the Association for Com-         seventh conference on Natural language learning at
   putational Linguistics conference, pages 132–139.          HLT-NAACL 2003-Volume 4, pages 9–16. Associa-
   Association for Computational Linguistics.                 tion for Computational Linguistics.

[Collins and Duffy2002] Michael Collins and Nigel
   Duffy. 2002. New ranking algorithms for parsing
   and tagging: Kernels over discrete structures, and
   the voted perceptron. In Proceedings of the 40th an-
   nual meeting on association for computational lin-
   guistics, pages 263–270. Association for Computa-
   tional Linguistics.
[Collins and Koo2005] Michael Collins and Terry Koo.
   2005. Discriminative reranking for natural language
   parsing. Computational Linguistics, 31(1):25–70.
[Croce et al.2011] Danilo Croce, Alessandro Moschitti,
    and Roberto Basili. 2011. Structured lexical
    similarity via convolution kernels on dependency
    trees. In Proceedings of the Conference on Empiri-
    cal Methods in Natural Language Processing, pages
    1034–1046. Association for Computational Linguis-
    tics.
[Johnson and Ural2010] Mark Johnson and Ahmet En-
    gin Ural. 2010. Reranking the berkeley and brown
    parsers. In Human Language Technologies: The
    2010 Annual Conference of the North American
    Chapter of the Association for Computational Lin-
    guistics, pages 665–668. Association for Computa-
    tional Linguistics.
[Lavelli and Corazza2009] Alberto Lavelli and Anna
   Corazza. 2009. The berkeley parser at the evalita
   2009 constituency parsing task. In EVALITA 2009
   Workshop on Evaluation of NLP Tools for Italian.