<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>(Almost) Automatic Conversion of the Venice Italian Treebank into the Merged Italian Dependency Treebank Format</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Linda Alfieri</string-name>
          <email>lindalfieri1988@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Tamburini</string-name>
          <email>fabio.tamburini@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FICLIT, University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. This paper describes the automatic procedure we developed to convert an Italian dependency treebank into a different format. We defined about 4,250 formal rules for rewriting dependencies and token tags as well as an algorithm for treebank rewriting able to avoid rule interference. At the end of this process a large portion of the whole treebank was automatically converted, with very few errors, leaving only a small amount of work to be done manually.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The availability of large annotated language
resources is a prerequisite for the development of
reliable automatic annotation tools using machine
learning techniques.</p>
      <p>Automatic tools able to enrich real texts with
sentence syntactic structures are central
instruments in Natural Language Processing (NLP)
pipelines for a reliable annotation of text corpora.
Modern NLP parsers heavily depend on complex
training phases performed by examining manually
annotated treebanks. Data sparsity, especially for
low-resourced languages, seriously affect parsers
performances, forcing scholars to annotate more
and more data.</p>
      <p>
        Since 2012 the state-of-the-art for Italian
treebanks were not so satisfactory: three different
projects and institutions produced three treebanks
using different background theories, different
formats and also different syntactic structures. They
were the Italian Syntactic Semantic Treebank
ISST
        <xref ref-type="bibr" rid="ref10 ref12 ref8">(Montemagni and Simi, 2007)</xref>
        , the Turin
University Treebank - TUT
        <xref ref-type="bibr" rid="ref3">(Bosco et al., 2000)</xref>
        and the Venice Italian Treebank - VIT
        <xref ref-type="bibr" rid="ref8">(Delmonte
et al., 2007)</xref>
        . Table 1 outlines the main
characteristics of these treebanks at that time.
      </p>
      <p>Size (approx.)
tokens
sentences
Type</p>
      <sec id="sec-1-1">
        <title>ISST</title>
        <p>TUT</p>
        <p>VIT</p>
        <p>
          ISST and TUT were used as gold standards in
various evaluation campaigns
          <xref ref-type="bibr" rid="ref10 ref12 ref8">(CoNLL2007 and
EVALITA series)</xref>
          , but only in 2012 the research
groups developing such treebanks started to
integrate them into a unique resource. In 2012 the
Merged Italian Dependency Treebank - MIDT
was created and released by fusing the two
resources
          <xref ref-type="bibr" rid="ref4">(Bosco et al., 2012)</xref>
          and in the
following years this project evolved such resource
inserting it into the big Universal Dependency - UD
project
          <xref ref-type="bibr" rid="ref1 ref13">(Nivre, 2015; Attardi et al., 2015)</xref>
          , through
another intermediate step, the Italian Stanford
Dependency Treebank - ISDT
          <xref ref-type="bibr" rid="ref5">(Bosco et al., 2013)</xref>
          .
During this process some other annotated texts
were added to the treebank leveraging its size
to around 315,000 tokens and 12,700 sentences
(UD Italian, v1.3).
        </p>
        <p>This paper describes the latest effort for the
Italian treebank merging: the conversion,
harmonisation and integration of the written sections of
VIT, not previously included into ISST, with the
other two resources for reaching a global amount
of about 600,000 tokens and 23,000 sentences
syntactically annotated. For practical issues we
decided to convert VIT into the MIDT format and
then use the set of already designed automatic
procedures and checking programs to transform it into
the final UD format.</p>
        <p>
          There are other notable works aimed at treebank
conversion in various languages, for example we
can cite
          <xref ref-type="bibr" rid="ref2">(Bos et al., 2009)</xref>
          for Italian.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>The Venice Italian Treebank</title>
      <p>
        The Venice Italian Treebank was created by the
Laboratory of Computational Linguistics of the
Department of Language Sciences, University of
Venice
        <xref ref-type="bibr" rid="ref8">(Delmonte et al., 2007)</xref>
        . The theoretical
framework behind VIT syntactic representation is
the X-bar theory, thus the early version of the
treebank expresses syntactic information as trees.
      </p>
      <p>
        At a later time, one of the authors converted
the treebank from phrase-structure to dependency
structures
        <xref ref-type="bibr" rid="ref9">(Delmonte, 2009)</xref>
        , but this was not
distributed. This version of VIT was the starting point
for the conversion described in this paper.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The Merged Italian Dependency</title>
    </sec>
    <sec id="sec-4">
      <title>Treebank</title>
      <p>
        The Merged Italian Dependency Treebank was
created as a first attempt to merge two existing
Italian resources, namely the TUT and a special
version of the ISST treebank named ISST-TANL
        <xref ref-type="bibr" rid="ref4">(Bosco et al., 2012)</xref>
        and represents the starting
point for all subsequent attempts to convert and
harmonise this resource to different standards, first
the Stanford Dependencies1 and last the Universal
Dependencies2.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>VIT Conversion</title>
      <p>
        The main part of the VIT conversion process
was completely automatic. Using the Semgrex
package3
        <xref ref-type="bibr" rid="ref6">(Chambers et al., 2007)</xref>
        from the
StanfordNLP group, we set up a set of procedures that,
starting from the definition of conversion rules,
automatically converted the VIT into the MIDT
format. This procedure has been developed
specifi1http://nlp.stanford.edu/software/stanforddependencies.shtml
2http://universaldependencies.org/
3http://nlp.stanford.edu/software/tregex.shtml
cally for our conversion problem, but can be used,
in principle, to convert any dependency treebank
represented using the CoNLL format in a different
format that does not require re-tokenisation steps.
4.1
      </p>
      <sec id="sec-5-1">
        <title>The Semgrex language</title>
        <p>Semgrex represents nodes in a dependency
graph as a (non-recursive) attribute-value
matrix. It then uses regular expressions for
subsets of attribute values. For example,
fword:amo;tag:/N.*/g refers to any node
that has a value ‘amo’ for the attribute ‘word’ and
a ‘tag’ starting with ‘N’, while ‘fg’ refers to any
node in the graph. The most important part of
Semgrex is that it allows you to specify relations
between nodes or group of nodes. For example,
‘fg=1 &lt;subj fg=2’ finds all the pairs of nodes
connected by a directed ‘subj’ relation. Logical
connectives can be used to form more complex
patterns and node naming (the ‘=’ assignments)
can help retrieve matched nodes from the patterns.</p>
        <p>Unfortunately Semgrex is simply a query
language and, in its original form, cannot be
used to rewrite dependency (sub)graphs. In order
to extend the possibility of Semgrex, we then
modified the original application to manage pairs
of patterns: the first is used to search into the
treebank for the required subgraphs, and the second
is used to specify how the retrieved subsgraphs
have to be rewritten. For example the pattern pair
ftag:detg=1 &gt;arg ftag:noung=2 --&gt;
ftag:ARTg=1 &lt;DET ftag:NNg=2, what we
called a ‘Semgrex rule’, changes the direction of
the dependency and, at the same time, changes
the words tags and relation label. The starting and
final patterns have to contain the same number
of nodes and dependency edges. Node naming
has been the fundamental trick to introduce such
extension allowing for node matching between
patterns.
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Conversion Procedure</title>
        <p>For converting VIT into MIDT format, we
manually defined about 4,050 Semgrex rules each
capturing a specific syntactic configuration in VIT and
transforming it into the MIDT schema and about
150 rules for residual tag rewriting. We spent
about six months for writing the entire set of rules.</p>
        <p>We have defined a set of new rewriting
operations on a general dependency treebank:
DEL REL(graphID, depID, headID): deletes
a dependency edge between two graph nodes;
INS REL(graphID, depID, headID, label):
inserts a new labelled dependency edge
between two graph nodes;
REN TAG(graphID, nodeID, tag): replace
the tag of a specific graph node.</p>
        <p>The conversion task has been implemented as a
three-steps process:
first of all, each Semgrex rule is always
applied to the original treebank producing a set
of matching subgraphs that have to be
rewritten;
for each match, a set of specific operations for
rewriting the subgraph corresponding to the
processed matching are generated and stored;
last, the whole set of operations produced
processing the entire set of Semgrex rules,
each applied to the original treebank, is
sorted by graphID, duplicates are removed
and every operation is applied graph by graph
respecting the following order: first
dependency deletions, second dependency
insertions and lastly tag renaming.</p>
        <p>This way of processing the original treebank
and transforming it into the new format should
guarantee that we do not experience rule
interference during the conversion, because the
generation of the rewriting operations due to the
Semgrex rules application is decoupled from the real
treebank rewriting.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Some Linguistic Issues</title>
      <p>The set of rules manually written for converting
VIT dependency structures can be subdivided into
two macro-classes: (a) rules that do not modify the
structures and (b) rules that need to modify the
dependencies, both in term of edge direction and in
term of different structuring between the involved
nodes.</p>
      <p>Regarding the rules that do not modify the
dependency structures, they simply rename the
dependency label using a 1:1 or an N:1 look-up
table, as VIT, with respect to MIDT, typically
involves more specific dependency types. Figure
1 outlines some simple examples of such kind of
conversions.</p>
      <p>There are, of course, other kind of operations
on subgraphs that require also the rewriting of the
dependency structure. A good example concerns
relative clauses in which the role of the relative
pronoun and, as a consequence, the connections
of the edge expressing the noun modification are
completely different in the two formalisms. Figure
2 shows one example of this kind of rewriting.</p>
      <p>Cases of coordination presented several
problems: in VIT the head of the coordinated
structure is linked to the connective and then the two
(or possibly more) coordinated structures can be
linked with a wide range of different dependency
types (e.g. between phrases - sn, sa, savv, sq,
sp, predicative complements - acomp, ncomp,
adjuncts - adj, adjt, adjm, adjv, subjects - subj,
objects - obj, etc.) leading to a large number of
different combinations. Moreover, each
dependency combination has to be further specified by
the different token tags. MIDT represents
coordinate structures in a different way: the connective
and the second conjunct are both linked to the first
conjunct that is connected to the head of the
coordinated structure.</p>
      <p>Figure 3 shows one example: the first formal
rule represents an abstract rule pattern that has to
be filled with all the real tag combinations found
in VIT, generating a huge number of different
rules, one of them outlined by the second
complete formal rule. This process generated more
than 2,800 different rules for handling all the
coordinated structures in VIT.</p>
      <p>There is also a need for a third kind of rules
for rewriting single PoS-tags that might have
remained unchanged during the main conversion
process.</p>
      <p>One further point deserves some discussion. In
VIT, articulated prepositions are represented as
two different tokens both linked with a common
head: the preposition is tagged part/partd/partda
and usually connected to the head with some kind
of modification relations and the article is always
tagged art and linked to the head with a det
relation. In MIDT articulated prepositions are
represented by a single token. As we said before,
our process does not allow re-tokenisation rules.
Given that MIDT is only an intermediate format
and the goal is to convert VIT into the UD standard
that requires two tokens for this phenomenon, we
decided to avoid any re-tokenisation and to convert
such structures linking the preposition to the head
and the article to the preposition by introducing a
new, dummy, relation label ‘REL EA’.
6</p>
    </sec>
    <sec id="sec-7">
      <title>Evaluation</title>
      <p>Applying all the 4,250 Semgrex rules, we obtained
a converted treebank in which 228,534 out of
280,641 dependency relation were automatically
converted, giving a global coverage of 81.4%.</p>
      <p>To test the effectiveness of the conversion
procedure and the conversion rules we randomly
selected 100 sentences (2582 dependency relations
to be converted) from the treebank and manually
checked every newly created dependency relation,
both in term of the connected nodes and the
assigned label.</p>
      <p>We obtained the following results: among the
2008 relations that have been automatically
converted we found 125 wrongly converted
dependency relations. So, on this sample, we obtained a
coverage of 2008/2582 = 77.8%, slightly less than
on the whole treebank, with a conversion error rate
= 125/2008 = 6.2%.
7</p>
    </sec>
    <sec id="sec-8">
      <title>Discussion and Conclusions</title>
      <p>This paper presents the procedure we developed
to convert VIT, one Italian treebank, into a
different format. Most of the described conversion
procedure rely on an automatic algorithm based on
formal rules that is able to automatically convert
the 81.4% of the treebank. This procedure can be,
in principle, adaptable to any conversion between
different dependency treebank formats.</p>
      <p>The formal rules has been manually defined by
using a well known dependency search procedure,
Semgrex from StanfordNLP group, properly
extended to handle rewriting rules and the final result
was manually evaluated to test the effectiveness of
the written rules obtaining a very small error rate.</p>
      <p>
        To the best of our knowledge, there is no general
purpose tool available to automatise this task for
dependency graphs. We can find some powerful
converters in literature but they are usually tied to
specific pair of tagsets (often tailored to the Penn
treebank)
        <xref ref-type="bibr" rid="ref10 ref12 ref7 ref8">(Johansson and Nugues, 2007; Choi and
Palmer, 2010)</xref>
        , and cannot be easily adapted to
general needs, or are devoted to tree manipulation,
for example the tool ‘Tregex’
        <xref ref-type="bibr" rid="ref11">(Levy and Andrew,
2006)</xref>
        .
      </p>
      <p>Even if the described procedure can convert a
large part of the treebank automatically with a very
small quantity of errors, the conversion certainly
needs a careful manual analysis to complete the
task and check the new treebank for remaining
mistakes. The VIT treebank contains a lot of
specific and peculiar dependency subgraph for
representing phenomena in a very detailed way. Trying
to capture all these different variations into formal
rules can result in a very large rule set mostly
composed of rule that handle single cases. We stopped
the production of new rules when this situation
arose.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We wish to thank Rodolfo Delmonte and Maria
Simi for their precious suggestions and
explanations for the analysis of linguistic phenomena and
for defining the conversion process.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Attardi</surname>
          </string-name>
          , Simone Saletti, and
          <string-name>
            <given-names>Maria</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank</article-title>
          .
          <source>In Proc. of 2nd Italian Conference on Computational Linguistics - CLiC-it</source>
          <year>2015</year>
          , pages
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          , Trento.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Johan</given-names>
            <surname>Bos</surname>
          </string-name>
          , Cristina Bosco, and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Mazzei</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Converting a dependency treebank to a categorial grammar treebank for Italian</article-title>
          .
          <source>In Proc. of 8th International Workshop on Treebanks and Linguistic Theories - TLT8</source>
          , Milano.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Vincenzo Lombardo, Daniela Vassallo, and
          <string-name>
            <given-names>Leonardo</given-names>
            <surname>Lesmo</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Building a treebank for Italian: a data-driven annotation schema</article-title>
          .
          <source>In Proc. 2nd International Conference on Language Resources and Evaluation - LREC</source>
          <year>2000</year>
          , pages
          <fpage>99</fpage>
          -
          <lpage>105</lpage>
          , Athens.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Simonetta Montemagni, and
          <string-name>
            <given-names>Maria</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Harmonization and Merging of two Italian Dependency Treebanks</article-title>
          .
          <source>In Proc. of LREC</source>
          <year>2012</year>
          , Workshop on Language Resource Merging, pages
          <fpage>23</fpage>
          -
          <lpage>30</lpage>
          , Istanbul.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Cristina</given-names>
            <surname>Bosco</surname>
          </string-name>
          , Simonetta Montemagni, and
          <string-name>
            <given-names>Maria</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Converting Italian Treebanks: Towards an Italian Stanford Dependency Treebank</article-title>
          .
          <source>In Proc. of ACL Linguistic Annotation Workshop &amp; Interoperability with Discourse</source>
          , Sofia.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Nathanael</surname>
            <given-names>Chambers</given-names>
          </string-name>
          , Daniel Cer, Trond Grenager, David Hall, Chloe Kiddon, Bill MacCartney, MarieCatherine de Marneffe, Daniel Ramage, Eric Yeh, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Learning Alignments and Leveraging Natural Logic</article-title>
          .
          <source>In Proc. of the Workshop on Textual Entailment and Paraphrasing</source>
          , pages
          <fpage>165</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Jinho</given-names>
            <surname>Choi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Martha</given-names>
            <surname>Palmer</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Robust Constituent-to-Dependency Conversion for English</article-title>
          .
          <source>In Proc. of 9th International Workshop on Treebanks and Linguistic Theories - TLT9</source>
          , Tartu, Estonia.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Rodolfo</given-names>
            <surname>Delmonte</surname>
          </string-name>
          , Antonella Bristot, and
          <string-name>
            <given-names>Sara</given-names>
            <surname>Tonelli</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>VIT - Venice Italian Treebank: Syntactic and Quantitative Features</article-title>
          .
          <source>In Proc. Sixth International Workshop on Treebanks and Linguistic Theories.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Rodolfo</given-names>
            <surname>Delmonte</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Treebanking in VIT: from Phrase Structure to Dependency Representation</article-title>
          . In Sergei Nirenburg, editor,
          <source>Language Engineering for Lesser-Studied Languages</source>
          , pages
          <fpage>51</fpage>
          -
          <lpage>81</lpage>
          . IOS Press, Amsterdam, The Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Richard</given-names>
            <surname>Johansson</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Nugues</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Extended Constituent-to-dependency Conversion for English</article-title>
          .
          <source>In Proc. of NODALIDA</source>
          <year>2007</year>
          , Tartu, Estonia.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Roger</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Galen</given-names>
            <surname>Andrew</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Tregex and Tsurgeon: tools for querying and manipulating tree data structures</article-title>
          .
          <source>In Proc. of 5th International Conference on Language Resources and Evaluation - LREC</source>
          <year>2006</year>
          , Genoa, Italy.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          and
          <string-name>
            <given-names>Maria</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>The italian dependency annotated corpus developed for the conll-2007 shared task</article-title>
          .
          <source>Tech. report, ILC-CNR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Joakim</given-names>
            <surname>Nivre</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Towards a Universal Grammar for Natural Language Processing</article-title>
          .
          <source>In Proc. of 16th International Conference Computational Linguistics and Intelligent Text Processing - CICLing</source>
          <year>2015</year>
          , pages
          <fpage>3</fpage>
          -
          <lpage>16</lpage>
          , Cairo, Egypt.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>