=Paper= {{Paper |id=None |storemode=property |title=Spotting and Improving Modularity in Large Scale Grammar Development |pdfUrl=https://ceur-ws.org/Vol-954/paper12.pdf |volume=Vol-954 }} ==Spotting and Improving Modularity in Large Scale Grammar Development== https://ceur-ws.org/Vol-954/paper12.pdf
    Spotting and Improving Modularity in Large
           Scale Grammar Development

                                    Simon Petitjean

                             LIFO, University of Orleans



       Abstract. XMG (eXtensible MetaGrammar) is a metagrammar com-
       piler which has already been used for the design of large scale Tree Ad-
       joining Grammars and Interaction Grammars. Due to the heterogeneity
       in this field (different grammar formalisms, different languages, etc), a
       particularly interesting aspect to explore is modularity. In this paper, we
       discuss the different spots where this modularity can be considered in a
       grammar development, and its integration to XMG.



1     Introduction
Nowadays, a lot a applications have to deal with languages and consequently need to
manipulate their descriptions. Linguists are also interested in this kind of resources,
for study or comparison. For these purposes, formal grammars production has became
a necessity. Our work focuses on large scale grammars, that is to say grammars which
represent a significant part of the language.
    The main issue with these resources is their size (thousands of structures), which
causes their production and maintenance to be really complex and time consuming
tasks. Moreover, these resources have some specificities (language, grammatical frame-
work) that make each one unique.
    Since a handwriting of thousands of structures represents a huge amount of work,
part of the process has to be automatized. A totally automatic solution could consist
in a acquisition from treebanks, which is a widely used technique. Semi automatic
approaches are alternatives that give an important role to the linguist: they consist
in building automatically the whole grammar from information on its structure. The
approach we chose is based on a description language, called metagrammar [1]. The idea
behind metagrammars is to capture linguistic generalization, and to use abstractions
to describe the grammar.
    The context that initially inspired metagrammars was the one of Tree Adjoin-
ing Grammars (TAG) [8]. This formalism consists in tree rewriting, with two specific
rewriting operations: adjunction and substitution. An adjunction is the replacement
of an internal node by a an auxiliary tree (one of its leaf nodes is labelled with ⋆ and
called foot node) with root and foot node having the same syntactic category as the
internal node. A substitution is the replacement of a leaf node (marked with ↓) by a
tree with a root having the same syntactic category as this leaf node. The principle is
to apply these operations to a set of elementary trees to match the sentence we want
to parse. TAG is said to have a extended domain of locality, because those operations
(especially adjunction) and the depth of the trees allow to represent long distance
relations between nodes: two nodes of the same elementary tree can after derivation

R.K. Rendsvig and S. Katrenko (Eds.): ESSLLI 2012 Student Session Proceedings, CEUR Work-
shop Proceedings vol.: see http://ceur-ws.org/, 2012, pp. 112–120.
                                          Modularity in Grammar Development          113

end up at an arbitrary distance from each other. Here, we will only manipulate LTAG
(lexicalized-TAG), which means each elementary tree is associated with at least one
lexical element.
    What can we do to lower the amount of work implied by the conception of the
grammar ? Let us take a look at some rules:


                                                       N
                                                N⋆           S
                     S                                 C             S
              N↓     V⋄    N↓                        that   N↓            V⋄
             Sally sings   a song         the song   that    Sally       sings


           Fig. 1. Verb with canonical subject and canonical or extracted object


    Those two trees share some common points: part of the structure is the same (the
subject is placed before the verb in both circled parts), and the agreement constraints,
given in feature structures associated to nodes (not represented here), are similar. This
kind of redundancy is one of the key motivations for the use of abstractions. These
abstractions are descriptions of the redundant fragments we can use everywhere they
are needed.
    Metagrammars are based on the manipulation of those linguistic generalizations.
They consist in generating the whole grammar from an abstract description, permitting
to reason about language at an abstract level. The metagrammatical language we will
deal with here is XMG (eXtensible MetaGrammar) 1 , introduced in [4]. A new project,
XMG-2 2 , started in 2010 to achieve the initial goal of the compiler, extensibility,
which has not been realized yet: XMG-1 only supports tree based grammars (two
formalisms, Tree Adjoining Grammars and Interaction Grammars), and includes two
levels of description, the syntactic one and the semantic one. Our goal is to go towards
two levels of modularity: we want it to be possible to assemble a grammar in a modular
way, thanks to a metagrammar assembled in a modular way.
    We will begin pointing out the modularity on the grammar side in section 2. In
section 3, we will focus on a new level of modularity, a metagrammatical one. In section
4, we will give an overview of what has been done, and what remains to be done. Finally,
we will conclude and give some perspectives.


2        Assembling grammars in a modular way
XMG consists in defining fragments of the grammar, and controlling how these frag-
ments can combine to produce the whole grammar. The following figure shows the
intuition of the combination of fragments to produce a tree for transitive verbs. It is
done by combining three tree fragments, one for the subject (in its canonical form, that
    1
        https://sourcesup.cru.fr/xmg/
    2
        https://launchpad.net/xmg
114                                   Simon Petitjean

we noticed redundant previously), one for the object (relative) and one for the active
form.


           Transitive                    CanSubj                 Active            RelObj
       N                                                                             N
N⋆             S                            S                      S          N⋆            S
       C                S            =     N↓      V        +     V⋄      +          C          S
      which N↓               V⋄                                                    which        N↓




    To build a lexicon, the metagrammar is first executed in an indeterministic way to
produce descriptions. Then these descriptions are solved to produce the models which
will be added to the lexicon.


2.1    The control language and the dimension system
The main particularity of XMG is that it allows to see the metagrammar as a logical
program, using logical operators.
   The abstractions (possibly with parameters) we manipulate are called classes. They
contain conjunctions and disjunctions of descriptions (tree fragments descriptions for
TAG), or calls to other classes. This is formalized by the following control language:

             Class      :=    Name[p1 , . . . , pn ] → Content
           Content      :=    hDimi{Desc} | Name[. . . ] | Content ∨ Content
                             | Content ∧ Content

For example, we can produce the two trees of the figure 1 by defining the tree fragments
for canonical subject, verbal morphology, canonical object and relativized object, and
these combinations:

                             Object → CanObj ∨ RelObj
                        Transitive → CanSubj ∧ Active ∧ Object

This part of metagrammar says that an object can either be a canonical object or a
relative object, and that the transitive mode is created by getting together a canonical
subject, an active form and one of the two object realizations.
    Notice that descriptions are accumulated within dimensions, which allow to sepa-
rate types of data. Sharing is still possible between dimensions, by means of another
dimension we call interface. In XMG’s TAG compiler for example, the syn dimension
accumulates tree descriptions while the sem dimension accumulates predicates repre-
senting the semantics. Each dimension comes with a description language, adapted to
the type of data it will contain. For each type of description we need to accumulate,
we have to use a different description languages. The first version of XMG provides a
tree description langague (for TAG or Interaction Grammars) associated with the syn
dimension and a language for semantics associated with the sem dimension.
                                           Modularity in Grammar Development            115

A tree description language
For trees in TAG, we use the following tree description language:

    Desc := x → y | x →+ y | x →∗ y | x ≺ y | x ≺+ y | x ≺∗ y | x[f :E]
               | x(p:E) | Desc ∧ Desc

where x and y are node variables, → and ≺ dominance and precedence between nodes
(+ and ∗ respectively standing for transitive and reflexive transitive closures). ’:’ is the
association between a property p or a feature f and an expression E. Properties are
constraints specific to the formalism (the fact that a node is a substitution node for
example), while features contain linguistic information, such as syntactic categories,
number or gender.
    When accumulated, the tree description in the syntactic dimension is still partial.
The TAG elementary trees that compose the grammar are the models for this partial
description. They are built by a tree description solver, based on constraints to ensure
the well-formedness of the solutions. XMG computes minimal models, that is to say
models where only the nodes of the description exist (no additional node is created).
Here is a toy metagrammar, composed of three description classes (representing canon-
ical subject, relative object, active form) and one combination class (transitive mode):

  CanSubj →hsyni{(s1 [cat : S] → v1 [cat : V ]) ∧ (s1 → n1 (mark : subst)[cat : N ])
                ∧ (n1 ≺ v1 )}
    RelObj →hsyni{(n2 [cat = N ] → n3 (mark = adj)[cat = N ]) ∧ (n2 → s2 [cat = S])
                ∧ (n3 ≺ s2 ) ∧ (s2 → c) ∧ (s2 → s1 [cat = S]) ∧ (c ≺ s1 )
                ∧ (c → wh[cat = wh]) ∧ (s1 → n1 [cat = n])}
     Active →hsyni{(s1 → v2 [cat : V ])}
T ransitive →CanSubj ∧ RelObj ∧ Active

    The minimal models for the classes named CanSubj, Active and Object are the
trees with matching names on the previous figure. The tree Transitive is a minimal
model for the description accumulated in class Transitive.


A language for semantics
To describe semantics, we use another description language, which is:

            SemDesc := ℓ : p(E1 , ..., En ) | ¬ℓ : p(E1 , ..., En ) | Ei << Ej | E

where ℓ is a label for predicate p (of arity n) and << is a scope-over relation for dealing
with quantifiers. To add binary relations to the semantic dimension, we can use a class
of this type:

                     BinaryRel[P red, X, Y ] → hsemi{P red(X, Y )}

When instantiated with P red=love, X=John, Y =M ary, calling the class BynaryRel
accumulates the predicate love(John, M ary).
116                                 Simon Petitjean

2.2     Principles
Some additional sets of constraints we call principles are available. Their goal is to
check some properties in the resulting models of the compilation, they are consequently
dependent from the target formalism. For example, in TAG, the color principle is a
way to forbid some fragments combination, by associating colors to each node.
    When unifying nodes, their colors are merged: a red node must not unify, a white
node has to unify with a black node, creating a black node, and a black node can only
unify with white nodes. The only valid models are the ones in which every node is col-
ored either in red or black. The following table shows the results of colors unifications.


                                        •b •r ◦w ⊥
                                     •b ⊥ ⊥ •b ⊥
                                     •r ⊥ ⊥ ⊥ ⊥
                                     ◦w •b ⊥ ◦w ⊥
                                     ⊥ ⊥ ⊥ ⊥ ⊥


                           Fig. 2. Unification rules for colors.


   For example, if we consider our previous example, the colored trees of the meta-
grammar are the following:

                          N•R                                      N•R


                      N•R     S•R                              N•R     S•R


       S◦W                C•R    S◦W         S•B      →            C•R     S•B


       N•B     V◦W       W h•R N◦W           V•B              W h•R N•B        V•B
         CanSubj            RelObj          Active

    The tree description solver (ignoring the colors) will produce models where the
nodes labelled S of CanSubj and Active unify with any of the two nodes labelled S
in RelObj, where the nodes labelled V do not unify, etc. But when filtering with the
colors principle, the only remaining model is the one of the right, which is linguistically
valid, contrary to the others.
    We can also cite the rank principle: we use it to add constraints on the ordering
of nodes in the models of the description. In French for example, clitics are necessarily
ordered, so we associate a rank property to some nodes, with values that will force the
right order.


3     Assembling metagrammars in a modular way
The main aim of the XMG-2 project is to make it possible for the linguist to design new
metagrammatical scopes, that can accomodate any linguistic theory. A simple way to
                                          Modularity in Grammar Development            117

realize this ambition is to provide a set of bricks the user can pick to build the compiler
he needs. Those bricks could be used to design new dimensions, with new description
languages or new principles.

3.1      A modular architecture
XMG compiler comes with a modular processing chain. Most of this chain is a standard
compiling chain, including a tokenizer for the metagrammar, a parser, an unfolder, etc.
    The particularity of XMG is to make it possible to chose the modules that suits the
best his metagrammar. By this mean, descriptions accumulated in different dimensions
can be handled differently. For example, the end of the processing chain for TAG is a
tree description solver, that builds the grammar’s elementary trees from the descrip-
tions accumulated in the syntactic dimension. The user can chose the kind of output
the compiler will produce: he can interactively observe the grammar he produced, or
produce an XML description of the grammar. This description can be used by a parser
(for example TuLiPA [9] 3 for TAG, or LeoPar 4 for IG).

3.2      Representation modules
As we wish to build a tool which is as universal as possible, being independent from the
formalism is a priority. To achieve this goal, we need to be able to describe any type of
structure into XMG. We saw the dimension system was useful to separate syntax from
semantics. It could also be used to separate tree descriptions from constraints based
descriptions, as long as we have a dedicated dimension, with a dedicated description
language.
    In [6], description languages for two formalisms, namely Lexical Functional Gram-
mars (LFG) and Property Grammars (PG), are proposed. Here, we will focus on Prop-
erty Grammars, because they differ from TAG in many aspects. PG are not based on
tree rewriting but on a local constraints system: the properties. A property concerns a
node and applies constraints over its children nodes. One of the interesting aspects of
PG is the ability to analyse non grammatical utterances. When parsing a utterance,
its grammaticality score is lowered at every violated property. Here, we will consider
these six properties:
        Obligation         A : △B       at least one B child
        Uniqueness         A : B!       at most one B child
        Linearity          A:B≺C        B child precedes C child
        Requirement A : B ⇒ C if a B child, then also a C child
        Exclusion          A : B 6⇔ C B and C children are mutually exclusive
        Constituency A : S              children must have categories in S
A real size PG consists in a inheritance hierarchy of linguistic constructions. These
constructions are composed of feature structures and a set of properties. Variables are
manipulated on both sides, and can be used to share data between them. Figure 3
represents a part of the hierarchy built in [7] for French. The V-n construction of the
figure says that in verbs with negation in French, negation implies the presence of
an adverb ne labelled with category Adv − ng (ne) and/or an adverb labelled with
category Adv − np (like pas). We also have a uniqueness obligation over these adverbs,

   3
       https://sourcesup.cru.fr/tulipa/
   4
       http://wikilligramme.loria.fr/doku.php?id=leopar:leopar
118                                         Simon Petitjean


   V (Verb)
                                                               V-n (Verb with negation) inherits V
                                h                    i                             "               #
   INTR ID—NATURE                   SCAT   1 .SCAT                                      RECT     1
                                                              INTR SYN NEGA                          
 const. V :                                                                           DEP     Adv-n
                      CAT V
              1                                                            Adv-ng
                      SCAT ¬ (aux-etre ∨ aux-avoir)            uniqueness Adv-np !
   V-m (Verb with modality) inherits V ; V-n                  requirement 1 ⇒Adv-n
                                       #
                                                                 linearity Adv-ng≺ 1
                          "
                            RECT 1
  INTR SYN
               INTRO                                                   Adv-ng≺Adv-np
                            DEP     Prep
                                                                           Adv-np≺ 1 .[M ODE inf ]
  uniqueness Prep!                                                          1 .[M ODE ¬inf ] ≺Adv-np
 requirement 1 ⇒Prep
    linearity 1 ≺Prep



              Fig. 3. Fragment of a PG for French (basic verbal constructions)



and an linear order must be respected (ne must come before pas). When the mode of
the verb is infinitive, the verb must be placed after the adverbs.
    To describe a PG, we need to be able to represent encapsulations, variables, feature
structures, and properties. We can notice that XMG classes can be seen as encapsu-
lations, and that variables and features structures were already used for TAG descrip-
tions. Considering that, the XMG description language for PG can be formalized this
way:

  DescPG := x = y | x 6= y | [f :E] | {P } | DescPG ∧ DescPG
        P := A : △B | A : B! | A : B ≺ C | A : B ⇒ C | A : B 6⇔ C | A : B

where x, y correspond to unification variables, = to unification, 6= to unification failure,
: to association between the feature f and some (possibly complex) expression E, and
{P } to a set of properties. Note that E and P may share unification variables.
The translation of the linguistic construction for V-m in XMG would be:

   V −m → (V class ∨ V −n) ∧ hP Gi{[INTR:[SYN:[INTRO:[RECT:X, DEP:Prep]]]]
                      ∧ (V : Prep!) ∧ (V : X ⇒ Prep) ∧ (V : X ≺ Prep)}

    Here, inheritance is made possible by calls of classes. The control language even
allows to do disjunctive inheritance, like it happens in class V-m. The end of the
compilation process for PG will differ from TAG’s one. We don’t need any solver for
descriptions, the accumulation into PG dimension is the grammar. To get the properties
solved for a given sentence, the solution is to use a parser as a post processor for the
compiler.
    Nevertheless, including a specific representation module to the compiler can be seen
as an ad-hock solution. That is why allowing the linguist to build his own description
language (for example, choosing to use feature structures, dominance relations between
nodes, open unification, etc), would be an essential feature.

3.3     Principle bricks
The notion of principles defined in XMG was too restrictive for our aims. Their speci-
ficity for the target formalism, for example, is incompatible with the multi-formalism
                                         Modularity in Grammar Development           119

ambition. An interesting way to handle principles is the one of [3], both allowing the
linguist to create his own principles or to use a subset of the ones already defined.
An example is the tree principle, which states that the solution models must be trees.
What we aim to provide is a meta-principles library: generic and parametrizable prin-
ciples the user can pick and configure. For example, the color principle provided for
TAG could be an implementation of a generic polarity principle, parametrized with the
table of figure 2. Another example of meta-principle is called unicity and was already
implemented in XMG-1. It is used to check the uniqueness of a specific attribute-value
pair in each solution, and thus is not specific to any linguistic theory.


3.4    Dynamic definition of a metagrammar
To build his own metagrammatical scope, one should only have to select the dimen-
sions he needs and the properties he wants to check on them. Building a dimension
would consist in picking bricks out from a library to create a new description language.
With this feature, a user could redefine the property grammars description we proposed
earlier. The advantage here is that the specific part of the compiler is written automat-
ically, and new features could be added just for experiments. Defining the principles
would just consist in taking meta-principles out from the library and instantiate them.
    Building a metagrammar compiler in this way allows to deal with a large range
of linguistic theories, or even to quickly experiment while creating a new grammar
formalism.


4     Current state of the work
XMG project started in 2003 with a first tool, that has been used to produce large
TAG grammars for French [2], German [10] and English, and a large Interaction Gram-
mar for French [11]. The compiler was written is Oz/Mozart, a language which is not
maintained any more and not compatible with today’s architectures (64 bits). It was
also important to restart from scratch, in order to build a compiler more in adequation
with its ambitions : modularity and extensibility.
    Consequently, a new implementation started in 2010, in YAP (Yet Another Prolog)
with bindings with Gecode for constraints solving. XMG-2 is currently the tool used
for modeling the syntax and morphology of Ikota, a bantu language [5], and is getting
close to total compatibility with the previous large metagrammars. It also includes a
dimension for basic property grammar description. The work focuses now on a parser
generator which, from a description of a description language, produces the parser rules
for this language. The first application could be the dynamic generation of a language
dedicated to morphologic descriptions. We also wish to implement quickly some generic
principles, beginning with the tree principle.


5     Conclusion
In this paper, we showed how modularity, together with a metagrammatical approach,
eases the development of a large scale grammar. This modularity is essential for reach-
ing the main goal of XMG, that is to say extensibility. Getting to that means taking a
big step towards multi-formalism and multi-language grammar development, and then
120                               Simon Petitjean

offers new possibilities for sharing data between different types of grammar, or even
for comparing them.
    Now, what we would like to create is a way to express the definition of dimensions
and meta-principles. This could begin by formalizing a description language for de-
scription languages. We also aim to provide more checking tools to the user, beginning
with the type checking of the properties and the feature structures we manipulate in
a lot of grammar formalisms.


References
 1. Candito, M.: A Principle-Based Hierarchical Representation of LTAGs. In: Pro-
    ceedings of COLING 96. Copenhagen, Denmark (1996)
 2. Crabbé, B.: Représentation informatique de grammaires fortement lexicalisées :
    Application à la grammaire d’arbres adjoints. Ph.D. thesis, Université Nancy 2
    (2005)
 3. Debusmann, R.: Extensible Dependency Grammar: A Modular Grammar Formal-
    ism Based On Multigraph Description. Ph.D. thesis, Saarland University (4 2006)
 4. Duchier, D., Le Roux, J., Parmentier, Y.: The Metagrammar Compiler: An NLP
    Application with a Multi-paradigm Architecture. In: Proceedings of the 2nd Oz-
    Mozart Conference, MOZ 2004. Charleroi, Belgium (2004)
 5. Duchier, D., Magnana Ekoukou, B., Parmentier, Y., Petitjean, S., Schang, E.: De-
    scribing Morphologically-rich Languages using Metagrammars: a Look at Verbs in
    Ikota. In: Workshop on ”Language technology for normalisation of less-resourced
    languages”, 8th SALTMIL Workshop on Minority Languages and the 4th work-
    shop on African Language Technology. Istanbul, Turkey (2012), http://hal.
    archives-ouvertes.fr/hal-00688643/en/
 6. Duchier, D., Parmentier, Y., Petitjean, S.: Cross-framework grammar engineering
    using constraint-driven metagrammars. In: CSLP’11. Karlsruhe, Allemagne (2011),
    http://hal.archives-ouvertes.fr/hal-00614661/en/
 7. Guénot, M.L.: Éléments de grammaire du français pour une théorie descriptive et
    formelle de la langue. Ph.D. thesis, Université de Provence (2006)
 8. Joshi, A.K., Schabes, Y.: Tree adjoining grammars. In: Rozenberg, G., Salomaa,
    A. (eds.) Handbook of Formal Languages. Springer Verlag, Berlin (1997)
 9. Kallmeyer, L., Maier, W., Parmentier, Y., Dellert, J.: Tulipa - parsing extensions
    of tag with range concatenation grammars (June 2009), first Polish-German Work-
    shop on Research Cooperation in Computer Science
10. Kallmeyer, L., Lichte, T., Maier, W., Parmentier, Y., Dellert, J.: Developing a
    tt-mctag for german with an rcg-based parser. In: LREC. ELRA (2008)
11. Perrier, G.: A French Interaction Grammar. In: RANLP. Borovets, Bulgaria (2007),
    http://hal.inria.fr/inria-00184108/en/