Generating a dictionary of control models for event extraction Fedor Nikolaev Vladimir Ivanov fsqcds@gmail.com nomemm@gmail.com Kazan Federal University Abstract also have insufficient corpus size for automatic genera- tion of a complete (for most Russian verbs) subordina- A subordination dictionary is important in a tion dictionary. The main difference with previous woks number of text processing applications. We that ambiguous part of text was not processed at all. Re- present a method for generating such dictionary sulted set was filtered to exclude case ambiguity, infre- for Russian verbs using Google Books Ngram quent words and ngrams that are not allowed in Russian data. An intended purpose of the dictionary is grammar. The dictionary was evaluated on a corpora of an event extraction system for Russian that uses Russian fiction texts and texts from news site and showed the dictionary to define extraction patterns. good results. In this paper we present a alternative method for gen- 1 Introduction and Motivation erating a subordination dictionary using a Google Books Event extraction is an important task in information ex- Ngram Corpus (contains of 67 billion tokens). Main mo- traction from unstructured text. This task attracted num- tivation behind this work is to facilitate an event extrac- ber of researcher in last decade. An event extraction sys- tion system for Russian that is focused on event types tem aims at capturing certain parts of a text (e.g. event described in ACE [1]. Here we consider the case when a type, participants and attributes). One of the central con- trigger is the main verb (or predicate) that acts as a syn- cepts in event extraction is a trigger word (usually a sep- tactic head for all participants of a corresponding event arate verb) denoting a type of an event [1]. On one hand, (participants of the event act as syntactical arguments of the trigger word indicates presence of an event in a sen- the predicate). We start with a brief overview of user in- tence. On the other hand, the trigger is considered as a terface that can be used for both pattern definition and main part in knowledge-based (KB) approach to event dictionary correction. Then we describe the method for extraction. generation a subordination dictionary. According to this approach, rules (or patterns) and dictionaries are used. These patterns may be generated 2 User interface for pattern and dictionary automatically [2] or defined manually [3]. However, in construction languages with free word order (e.g. Russian) a devel- oper of that patterns should also take into account all For managing our dictionary we developed a user inter- possible arrangements of words in a sentence. In this face, shown in Figure 1, that allows to define non-linear case it is more natural to define pattern parts as indepen- extraction patterns. A type of the event can be chosen dent pairs: "event-participant" which will be automati- from a drop-down in the top bar. The panel below shows cally mapped to "predicate-argument" pairs that denote a argument types for the event type. There is an interface subordination in a parse tree of a sentence at hand. Thus for dealing with verbs. Existing verbs can be edited and a complete subordination dictionary becomes a crucial new verbs can be added. In a simple tabular interface element of a knowledge-based event extraction system. user can set preposition, grammatical case of the argu- A well-known limitation of recent works in this area is ment and select participant type. For a few events and insufficient dictionary size that prevents using such dic- triggers using this application for filling dictionary might tionaries in a computer system. be enough, but it becomes harder to define all the prepo- In 2013 Klyshinsky et al. [4] generated such dictio- sitions and relevant cases as the number of event types nary for Russian verbs using a set of web corpora; all and verbs grows. corpora together contain about 10-11 billion tokens. Au- The method we propose for subordination dictionary thors proposed a method for automatic generation of dic- generation is based on processing Google Books Ngram tionary for verbs and prepositions. Klyshinsky et al. re- data set. The study was carried out for Russian, but ported that the dictionary size was about 25-30 thousand this method is applicable to other languages for which verbs. Their method deals only with lexical information, Google Books Ngram Corpus and morphological dictio- i.e. extraction of verb(-preposition)-noun dependencies nary are available. was done with six simple finite automata, and no pars- ing step was performed. Treebanks of Russian language 3 A subordination dictionary Proceedings of the Tenth Spring Researcher’s Colloquium The main idea is based on using the Google Books on Database and Information Systems, Veliky Novgorod, Rus- Ngram Corpus (GBNC) that was enriched with morpho- sia, 2014 logical information and filtered with certain rules. Thus we got an enriched dataset that has the following format: n1, match_count, pos, lemma, gram, where n1 is a word from the GBNC 1-gram dataset; pos, lemma and gram stand for POS-tag, lemmatized word form and vector of grammatical features respec- tively. Ambiguous words have led to several records in the this enriched dataset. For instance, n1, match_count, pos, lemma_id, gramA n1, match_count, pos, lemma_id, gramB where ambiguous word n1 has two sets of grammat- ical features: gramA and gramB. In all such cases we omit these conflicting rows from the dataset, because tak- ing these records into account adds a lot of noise. 3.3 Dictionary of verbal models construction Let us briefly describe a technique we use for generat- ing a dictionary of direct subject control. To this end we capture all pairs (head, dep) with POS-tag of the head part equals to ’VERB’ and having a certain grammatical case for the dependent part (dep), say ’gent’ for Genitive. Finally, we group all these pairs by “lemma_id” (in or- der to regard different forms of the same verb) and count the number of records and summate match_count values. Basically, we run the following SQL-query against the preprocessed dataset: Figure 1: A simple user interface for definition of event extraction patterns CREATE TABLE direct_verbal_control as SELECT 3.1 Google Books Ngram Corpus dep_bigrams.lemma_id, Russian subset of Google Books Ngram Corpus contains dep_bigrams.n1, 67,137,666,353 tokens extracted from 591,310 volumes SUM(CASE [6], mostly from past three centuries. The most part of WHEN dep_bigrams.gram LIKE ’%nomn%’ books was drawn from university libraries. Each book THEN dep_bigrams.count was scanned with custom equipment and the text was ELSE 0 END) AS nomn, digitized by means of OCR. Only ngrams that appear ... over 40 times across the corpus are included to dataset. SUM(CASE WHEN dep_bigrams.gram LIKE ’%loct%’ THEN dep_bigrams.count 3.2 Coprus preprocessing ELSE 0 END) AS loct, FROM dep_bigrams The original GBNC data set contains statistics on occur- WHERE dep_bigrams.pos=’VERB’ rences of n-grams (n=1. . . 5) as well as frequencies of GROUP BY dep_bigrams.lemma_id; binary dependencies between words. These binary de- pendencies represent syntactic links between words from In this example we have six aggregation (sum) func- Google Books texts. An accuracy of unlabeled attach- tions (one for each grammatical case, e.g. ’loct’ for the ment for Russian dependency parser reported in [6] is Locative). Each aggregation function in the query cal- 86.2%. culates total amount of dependency links between verbs As GBNC stores all statistics on a year-by-year basis, given a lemma_id and arbitrary word forms in a cer- each datafile contain tab-separated data in the following tain grammatical case. We apply the same technique format: ngram, year, match_count, volume_count. when generating model for control of a preposition from We have preprocessed the original data set in a special a 3-gram dataset. Queries differ only in the WHEN- way. First, for each dependency 2-gram (the same step condition and GROUP-BY operator that include addi- for each 3-gram), we have collected all its occurrences tional restriction on the second word in a 3-gram. on the whole data set and summate all “match_count” values since 1900. Aggregated data set consists of pairs (n-gram, count) for n=2, 3. This step also joined n-grams 4 Results and future work typed in different cases (lower and upper) into a single We run two types of queries described in previous section (lower case) n-gram. against the whole Google Books Ngram dataset. We have The next step was to assign each word in a data set got about 24 thousands of rows (one row per verb) from a POS-tag and morphological features. For this purpose the dataset of dependency pairs and about 51.5 thousands we used a morphological dictionary provided by Open- of rows from the dataset of 3-grams (a verb + preposition Corpora [5] to generate POS-tag and morphological fea- per row). Samples from the resulted dictionary are pro- tures for 1-grams only. vided in Table 1 and Table 2. The interesting result that Table 1: A part of generated dictionary for few frequent Russian verbs Verb Main case Genitive Dative Accusative Ablative or Instrumental сказать Dat. 0.183 0.573 0.057 0.133 дать Dat. 0.194 0.511 0.252 0.025 говорить Dat. 0.192 0.434 0.070 0.166 писать Dat. 0.207 0.389 0.174 0.123 указать Dat. 0.216 0.377 0.338 0.056 изменить Acc. 0.131 0.338 0.352 0.115 объяснить Ablt. or Instr. 0.093 0.292 0.113 0.489 читать Acc. 0.196 0.198 0.449 0.102 Table 2: Control of prepositions for verb “купить” (to buy) Verb Prep. Main case Genitive Dative Accusative Ablt. or Instr. Locative купить для Gent. 1.0 0.0 0.0 0.0 0.0 купить из Gent. 1.0 0.0 0.0 0.0 0.0 купить без Gent. 1.0 0.0 0.0 0.0 0.0 купить до Gent. 1.0 0.0 0.0 0.0 0.0 купить с Gent. 0.595 0.0 0.0 0.405 0.0 купить в Loc. 0.0 0.011 0.068 0.0 0.921 купить за Ablt. or Instr. 0.0 0.0 0.393 0.607 0.0 купить к Dat. 0.0 1.0 0.0 0.0 0.0 купить на Loc. 0.0 0.049 0.138 0.005 0.808 купить по Dat. 0.0 1.0 0.0 0.0 0.0 купить под Ablt. or Instr. 0.0 0.0 0.0 1.0 0.0 купить со Ablt. or Instr. 0.0 0.0 0.0 1.0 0.0 many verbs can subordinate words in almost any gram- russian language in knowledge-based ie systems. matical case. This result differs significantly from the 2012. results presented in [4]. We cannot consider this as an error of our calculation or the parsing method, but rather [4] Kochetkova N. A. Klyshinsky E. S. Method of au- as an effect of variations in sense of the verb. It might be tomatic generating of russian verb control models. useful to compare our dictionary to the dictionary gener- In XII National conference of artificial intelligence, ated from a web corpus [4]. 2013. In Russian. In our future work we will evaluate quality of the [5] Granovsky D. V. Protopopova E. V. Stepanova M. E. obtained dictionary. Finally, the will use the dictio- Surikov A. V. Bocharov V. V., Alexeeva S. V. Crowd- nary for definition a set of pattern parts (pairs) in our sourcing morphological annotation. In Computa- knowledge-based event extraction system. Those pairs tional Linguistics and Intellectual Technologies, Pa- will be marked with event participants manually. pers from the Annual International Conference “Di- alogue” (2013), Dialog ’13, 2013. References [6] Yuri Lin, Jean-Baptiste Michel, Erez Lieberman [1] George R. Doddington, Alexis Mitchell, Mark A. Aiden, Jon Orwant, Will Brockman, and Slav Petrov. Przybocki, Lance A. Ramshaw, Stephanie Strassel, Syntactic annotations for the google books ngram and Ralph M. Weischedel. The automatic content corpus. In Proceedings of the ACL 2012 System extraction (ace) program - tasks, data, and evalua- Demonstrations, ACL ’12, pages 169–174, Strouds- tion. In LREC, 2004. burg, PA, USA, 2012. Association for Computa- tional Linguistics. [2] Daria Dzendzik and Sergey Serebryakov. Semi- automatic generation of linear event extraction pat- terns for free texts. In Natalia Vassilieva, Denis Turdakov, and Vladimir Ivanov, editors, SYRCoDIS, volume 1031 of CEUR Workshop Proceedings, pages 5–9. CEUR-WS.org, 2013. [3] Valery Solovyev, Vladimir Ivanov, Rinat Ga- reev, Sergey Serebryakov, and Natalia Vassilieva. Methodology for building extraction templates for