Introduction

PCFG Extraction and Pre-typed Sentence Analysis

No´emie-Fleur Sandillon-Rezer nfsr@labri.fr

0 0 LaBRI, CNRS 351 cours de la Lib ́eration , 33405 Talence

150 159

We explain how we extracted a PCFG (probabilistic contextfree grammar) from the Paris VII treebank. First we transform the syntactic trees of the corpus in derivation trees. The transformation is done with a generalized tree transducer, a variation from the usual top-down tree transducers, and gives as result some derivation trees for an AB grammar, which is a subset of a Lambek grammar, containing only the left and right elimination rules. We then have to extract a PCFG from the derivation tree. For this, we assume that the derivation trees are representative of the grammar. The extracted grammar is used, through a slightly modified CYK algorithm that takes in account the probabilities, for sentences analysis. It enables us to know if a sentence is include in the language described by the grammar.

Introduction

This article describes a method to extract a PCFG from the Paris VII treebank [ 1 ]. The first step is the transformation of the syntactic trees of the treebank into derivation trees representative of an AB grammar [ 13 ], which corresponds to the elimination rules of Lambek Calculus, as shown in Fig. 1. We chose an AB grammar because we want our approach to be potentially compatible with some usual learning algorithms, like the one of Buszkowski and Penn [ 4 ] or Kanazawa [ 12 ]. Once we have the derivation trees, we extracted the PCFG from them, and use it for sentence analysis. This analysis helps us to know how we can improve our grammar and all the processing line used to get it, by analyzing why some correct sentences cannot be parsed or why some incorrect ones are still parsed. In a more long-viewed aim, parsing french sentences can be use for grammatical checking, and with semantic information over a lexicon, the grammar could be used for generate coherent sentences.

The Paris VII treebank [ 1 ] contains sentences from the newspaper Le Monde, analyzed and annotated by the Laboratoire de linguistique de Paris VII. The flat shape of trees does not allow the direct application of a usual learning algorithm, so we decided to use a generalized tree transducer. For our work, we use a subpart of the treebank, on a parenthesized form, composed by 12, 351 sentences. Even if the whole treebank was in an XML form, the parenthesized form is easier to treat with the transducer. The 504 sentences left aside will be an evaluation treebank that we use as a control group.

Another new treebank, Sequoia [ 5 ], which is composed by 3, 200 sentences coming from different horizons, will also be used for experimentation. It is annotated using the same convention as the French Treebank.

This article will firstly overfly the transducer we use to transform syntactic trees into derivation trees, then we will focus on PCFG extraction. In a third part, we will detail the experimental results, obtained by using our PCFG to find the best analysis for a sentence, via the CYK algorithm [ 19 ]. 2

Generalized Tree Transducer It exists many way to make a syntactic analysis of a treebank, as we can see with the work of Hockenmaier [ 8 ], or Klein and Manning [ 10 ], but they were not applicable over the French Treebank or they did not gave simple AB grammar.

The transducer we created is the central point of the grammar extraction process. Indeed, the binarization of syntactic trees parametrize the extracted grammar. We based our works on usual derivation rules of an AB grammar used in computational linguistic and the annotations of the treebank itself [ 2 ]. The annotations give two types of information about the trees : POS-tag: the Part-Of-Speech tags, booked for the pre-terminal nodes, indicates the POS-tag of the daughter node. For example NC will be used for Common Noun, DET for a determinant, etc.

Phrasal types: the nodes which are not a terminal or a pre-terminal node are annotated with their syntactic categories and sometime the role of the node. A NP-SUJ node will correspond to a Noun Phrase used as a subject for the sentence. For the usual derivation rules, they are instantiation of elimination rules of Lambek calculus (see Fig. 1). We based ourselves on other annotation methods : a NP node will have the type np, a sentence type will be s, a preposition phrase taken as an argument will have the type pp, and so on.

A transducer, like defined by TATA [ 7 ], is an automaton which read an input and write something on output. It can be applied to trees and transform the shape of them. The transducers have some feature especially important for our work. Indeed, they are non erasing (it ensures us that we do not lost informations during the transduction), linear (the transduction will not change the order of the words in the sentence) and ǫ-free (gives more control over the transduction). The G transducer, developed by Sandillon-Rezer in [ 17 ], has additional features that feet better with the global shape of the treebank: Recursivity: Our transducer can apply a rule to a node, looking only on its label but with an arbitrary arity. It generalize the usual definition of transduction rule, by matching node with an arbitrary number of daughters. The study of each specific case of the rule can transform the recursive rule into a set of ordinary rules. Parametrization: We allow the rules to have some generic nodes which replace a node from a finite set of nodes. This quantification is equivalent to write each instantiation of the generic node.

Priority system: As our transducer needs to be deterministic, we decided to apply the rules always in a given order. It ensures us to have only one output tree.

The transduction rules have been written from a systematic analysis of the different shapes we could find in the treebank. As an example, a syntactic tree from the treebank and its transduction is shown Fig. 2.

PP-P OBJ CLS-SUJ V

On ouvre

P sur

NPP la Cinq VN s ouvre

P pp/np sur

SENT s

NP np NPP np la Cinq

SENT PONCT

TEXT txt

s\s CC mais

COORD VN Sint PONCT

AdP-MOD CLS-SUJ V ADV on glisse assez ADV vite

PONCT s\txt

COORD s\s

mais

VN s

AdP-MOD s\s CLS-SUJ np V np\s ADV (s\s)/(s\s) ADV s\s on glisse assez vite CLS-SUJ np np\s

PONCT (s\s)/(s\s)

On V (np\s)/pp PP-P OBJ pp , CC (s\s)/s

Sint s Grammar extraction

Even if the lexicon, extracted from the derivation trees, is representative enough of an AB grammar, and gives a probabilistic distribution of different types for words, it limits the sentence analysis to the sole lexicon of the treebank. This is the reason we decided to extract a PCFG from the derivation trees.

The output trees give both syntactic (we keep the initial labels) and structural information over sentences. We decided to proceed to a preprocessing step, in order for the user to control the extracted information. The only part of information we always keep is the type of nodes. The extracted grammar is a PCFG, the probabilities are computed based on their root node. We remind that our grammar is defined by a tuple {N, T, S, R}, where N is the set of non-terminal symbols (internal nodes of trees), T the set of terminal symbols (typed leaves), S the initial symbol1 and R the set of rules.

The extraction algorithm parses the trees and stores the derivation rules it sees. A rule is composed by a root and one or two daughters, then this is the usual case of a right or left elimination rule (a → a/b b or a → b a\b). Otherwise, it is only a type transmission which appears when a noun phrase is composed by a proper name only; or at the pre-terminal node level when the POS-tag node transmits type to the leaf. The probabilities are computed on a root related group.

The table 1 summarizes the grammars potentially generated. Each one presents useful information: the first one, from the derivation trees without preprocessing step, keeps the syntactic informations given by the treebank. The others are more useful for the application of a sentence analysis algorithm, like CYK (see section 4), on non-typed sentences. The table 2 shows some extracted rules. The analysis process can be subdivided in two parts. On the one hand, we have to type the words, while staying as close as possible to standard Lambek derivations. On the other hand, the needed rules must belong to the input grammar. By gathering the leaves of the derivation trees, we have a lexicon, as we can see Fig. 3. However, a typing system based only on this lexicon reduces the possibility of parsing to the sentences composed by words from the French Treebank. We decided to type words with the Supertagger (see Moot [ 15, 16 ]), which enabled us to validate the Supertagger results and to analyze sentences which did not occur in the Paris VII treebank.

The Supertagger is trained with the lexicon extracted from the transduced derivation trees.

1S = TXT:txt or txt, depending of the preprocessing step. We decided to use the CYK [ 19 ] algorithm, already tested and considered as a reference, with a probabilistic version for the parsing of sentences. We removed the typing step done initially by CYK with the rules n1 → t1 1, replaced by the Supertagger work. We use the simplest grammar ( of 3, 494 rules ) for the analysis. The first test, to assure the correct running of the program, was to re-generate the trees from the transduced sentences with the grammar extracted from the derivation trees. Then, we tested the analysis with sentences typed by the Supertagger, using the grammar extracted from the main treebank (see results section 5.

The derivation trees corresponding to the sentence ”Celui-ci a import´e `a tout va pour les besoins de la r´eunification.” (”This one imported without restraint for the need of reunification need.”) are shown in Fig. 4. We took the two most probable trees, typed by the Supertagger and analyzed with the grammar extracted from the main treebank. Two types of information are relevant to choose the best trees: we took into account the probability and the complexity of types. However, it is known that the comparison between two trees that do not have the same shape or leaves is complex. The main difference between the two trees is the prepositional phrase attachment; the most probable tree is more representative of the original treebank. `a tout va s\s pour (s\s)/np

np s

. s\txt txt s\s les np/n

n besoins n

n\n de (n\n)/np

np la np/n r´eunification n txt s

. s\txt Celui-ci np

np\s a (np\s)/(np\sp)

np\sp (np\sp)/pp import´e (np\sp)/pp a` tout va ((np\sp)/pp)\((np\sp)/pp) pp pour pp/np

np les np/n

n besoins n

n\n de (n\n)/np

np la np/n r´eunification n

Results and evaluation G-Transducer

From now, the transducer treats at least 88% of the corpora (see Table 3 for details). The lower percentage on the evaluation treebank can be explained by the study of the remaining sentences: they are colloquial and complex. On Sequoia, the better results can be explain by the greater simplicity of sentences.

The use of rules, for the main treebank, is summarized in Table 4. We note that even if many rules are used infrequently, they do not have a real weight in the global use of the rules. Some of the important rules are shown in the Table 5 in the Tregex [ 14 ] parenthesized way. We were surprised too by the few occurrences of the rule that treats the determiner at the beginning of a noun phrase, despite the amount of NP in the treebank, but there is a rule which treats only the case of a noun phrase composed by a determinant and a noun, called 8, 892 times.

The derivation trees of the three corpora allow us to extract three different grammars, and in addition, lexicon were created, containing words and their formulas. The lexicon covers 96.6% of the words of the Paris VII treebank, i.e. 26, 765 words on 27, 589, and for Sequoia, it covers 95.1%, i.e. 18, 350 word on 19, 284.

Sentence parsing The parsing of typed sentences is done with the grammar extracted from the main treebank, which is the most covering one. Each treebank has been divided in two part, the transduced one and the non-transduced one. For the Supertagger, we used a β equal to 0.01: even if the supertagging and the parsing steps are slower, the results are much better than a β of 0.05. The results are gathered in Table 6. We note that non-transduced sentences are nevertheless analyzed, even if the results are less accurate than for the other sentences.

We also tested the precision of the Supertagger (for the whole tests, see [ 18 ]): the Supertagger can adjust the number of types given to a word, with the β parameter. It enables to select formula which the confidence of the Supertagger is at least β times the confidence of the first supertag. We summarize the time spent to analyze the fragment of 440 sentences of the evaluation corpus, the effectiveness of the algorithm by modifying β in Table 7. The high number of types is due to the limitations of AB-grammar. When the Supertagger is used for multimodal categorial grammars, the average number of formulas is around 2.4 with a β equal to 0.01 and 4.5 with β = 0.001; the correctness is better too, with respectively a rate of 98.2% and 98.8%.

Conclusion and prospects

In this article, we have briefly introduced the G-transducer principle, that we used to transform syntactic trees into derivation trees of an AB grammar. Then we explained how we extracted a PCFG and used it in the sentence analysis. The experimental results of the CYK algorithm used with typed sentences enabled us to compare the annotation from the transducer and the Supertagger.

However, the work is still ongoing, and opens many horizons. Of course, we want to extend the coverage of the transducer to exceed 95% and simplify the types of words. The main problem is that only complex cases remain, but we should be able to find derivation trees, given that we can analyze a part of them with the CYK algorithm. In order to improve parsing precision, we intend to integrate modern techniques such as those of [ 3, 20 ] into our parser. Using the Charniak method [ 6 ], we would like to transform our grammar into a highly lexicalized grammar.

Given that AB grammars may seem limiting in the case of a complex language, we wish to transform our transducer into a tree to graph transducer. This way, we would be able to use the whole Lambek calculus.

The XML version of the Treebank gives more informations on the words, like the tense of verbs. A major evolution would be to reflect this information into our transducer, even if it implies many transformation for it.

Our work and programs are available on [ 21 ], under GNU General Public Licence.

References

1. Abeill´e, A. , Cl´ement, L., Toussenel , F. : Building a treebank for French . Treebanks, Kluwer, Dordrecht ( 2003 ).

2. Abeill´e, A. , Cl´ement, L., Toussenel , F. : Annotation Morpho-syntaxique. http://llf.linguist.jussieu.fr ( 2003 ).

3. Auli

and Lopez

: Efficient CCG Parsing: A* versus Adaptive Supertagging . In Proceedings of the Association for Computational Linguistics ( 2011 ).

4. Buszkowski , W. , Penn , G.: Categorial grammars determined from linguistic data by unification . Studia Logica ( 1990 ).

5. Candito

and Seddah D. : Le corpusSequoia : annotation syntaxique et exploitation pour l'adaptation d'analyseur par pont lexical TALN'2012 proceedings , Grenoble, France ( 2012 ).

6. Charniak

A maximum-entropy-inspired parser . In Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL) , Seattle ( 2000 ).

7. Comon , H. , Dauchet , M. , Jacquemard , F. , Lugiez , D. , Tison , S. , Tommasi , M. : Tree automata techniques and applications ( 1997 ).

8. Hockenmaier

: Data and Models for Statistical Parsing with Combinatory Categorial Grammar . PhD thesis , 2003 .

9. Hopcroft

J.E.

and Ullman J.D. : Introduction to Automata Theory, Languages, and Computation . Addison-Wesley Publishing Company ( 1979 ).

10. Klein

and Manning

: Accurate Unlexicalized Parsing . In Proceedings of the Association for Computational Linguistics ( 2003 ).

11. Knuth

D.E.

: The Art of Computer Programming Volume 2 :

Seminumerical

Algorithms (3rd ed.) Addison-Wesley Professional ( 1997 ).

12. Kanazawa , M. : Learnable Classes of Categorial Grammars. Center for the Study of Language and Information ( 1998 ).

13. Lambek , J.: The Mathematics of Sentence Structure . ( 1958 ).

14. Levy

, Andrew G.: Tregex and Tsurgeon: tools for querying and manipulating tree data structures . http://nlp.stanford.edu/software/tregex.shtml ( 2006 ).

15. Moot , R.: Automated extraction of type-logical supertags from the Spoken Dutch Corpus. Complexity of Lexical Descriptions and its Relevance to Natural Language Processing: A Supertagging Approach ( 2010 ).

16. Moot , R.: Semi-automated Extraction of a Wide-Coverage Type-Logical Grammar For French . Proceedings TALN 2010 , Montreal ( 2010 ).

17. Sandillon-Rezer , N.F. : Learning categorial grammar with tree transducers . ESSLLI Student Session Proceedings ( 2011 ).

18. Sandillon-Rezer , N-F. : Extraction de PCFG et analyse de phrases pr´e-typ´ ees Recital 2012 , Grenoble ( 2012 ).

19. Younger

D.H.

: Context Free Grammar processing in n3 . ( 1968 ).

20. Zang

and Clarck

S.:

Shift-Reduce CCG Parsing . In Proceedings of the Association for Computational Linguistics ( 2011 ).

21. Sandillon-Rezer , N.F. : http://www.labri.fr/perso/nfsr/ ( 2012 ).