Generating a dictionary of control models for event extraction

                               Fedor Nikolaev                    Vladimir Ivanov
                             fsqcds@gmail.com                  nomemm@gmail.com
                                               Kazan Federal University


                       Abstract                               also have insufficient corpus size for automatic genera-
                                                              tion of a complete (for most Russian verbs) subordina-
     A subordination dictionary is important in a             tion dictionary. The main difference with previous woks
     number of text processing applications. We               that ambiguous part of text was not processed at all. Re-
     present a method for generating such dictionary          sulted set was filtered to exclude case ambiguity, infre-
     for Russian verbs using Google Books Ngram               quent words and ngrams that are not allowed in Russian
     data. An intended purpose of the dictionary is           grammar. The dictionary was evaluated on a corpora of
     an event extraction system for Russian that uses         Russian fiction texts and texts from news site and showed
     the dictionary to define extraction patterns.            good results.
                                                                  In this paper we present a alternative method for gen-
1    Introduction and Motivation                              erating a subordination dictionary using a Google Books
Event extraction is an important task in information ex-      Ngram Corpus (contains of 67 billion tokens). Main mo-
traction from unstructured text. This task attracted num-     tivation behind this work is to facilitate an event extrac-
ber of researcher in last decade. An event extraction sys-    tion system for Russian that is focused on event types
tem aims at capturing certain parts of a text (e.g. event     described in ACE [1]. Here we consider the case when a
type, participants and attributes). One of the central con-   trigger is the main verb (or predicate) that acts as a syn-
cepts in event extraction is a trigger word (usually a sep-   tactic head for all participants of a corresponding event
arate verb) denoting a type of an event [1]. On one hand,     (participants of the event act as syntactical arguments of
the trigger word indicates presence of an event in a sen-     the predicate). We start with a brief overview of user in-
tence. On the other hand, the trigger is considered as a      terface that can be used for both pattern definition and
main part in knowledge-based (KB) approach to event           dictionary correction. Then we describe the method for
extraction.                                                   generation a subordination dictionary.
    According to this approach, rules (or patterns) and
dictionaries are used. These patterns may be generated        2   User interface for pattern and dictionary
automatically [2] or defined manually [3]. However, in            construction
languages with free word order (e.g. Russian) a devel-
oper of that patterns should also take into account all       For managing our dictionary we developed a user inter-
possible arrangements of words in a sentence. In this         face, shown in Figure 1, that allows to define non-linear
case it is more natural to define pattern parts as indepen-   extraction patterns. A type of the event can be chosen
dent pairs: "event-participant" which will be automati-       from a drop-down in the top bar. The panel below shows
cally mapped to "predicate-argument" pairs that denote a      argument types for the event type. There is an interface
subordination in a parse tree of a sentence at hand. Thus     for dealing with verbs. Existing verbs can be edited and
a complete subordination dictionary becomes a crucial         new verbs can be added. In a simple tabular interface
element of a knowledge-based event extraction system.         user can set preposition, grammatical case of the argu-
A well-known limitation of recent works in this area is       ment and select participant type. For a few events and
insufficient dictionary size that prevents using such dic-    triggers using this application for filling dictionary might
tionaries in a computer system.                               be enough, but it becomes harder to define all the prepo-
    In 2013 Klyshinsky et al. [4] generated such dictio-      sitions and relevant cases as the number of event types
nary for Russian verbs using a set of web corpora; all        and verbs grows.
corpora together contain about 10-11 billion tokens. Au-          The method we propose for subordination dictionary
thors proposed a method for automatic generation of dic-      generation is based on processing Google Books Ngram
tionary for verbs and prepositions. Klyshinsky et al. re-     data set. The study was carried out for Russian, but
ported that the dictionary size was about 25-30 thousand      this method is applicable to other languages for which
verbs. Their method deals only with lexical information,      Google Books Ngram Corpus and morphological dictio-
i.e. extraction of verb(-preposition)-noun dependencies       nary are available.
was done with six simple finite automata, and no pars-
ing step was performed. Treebanks of Russian language         3   A subordination dictionary
Proceedings of the Tenth Spring Researcher’s Colloquium
                                                              The main idea is based on using the Google Books
on Database and Information Systems, Veliky Novgorod, Rus-    Ngram Corpus (GBNC) that was enriched with morpho-
sia, 2014                                                     logical information and filtered with certain rules.
                                                                Thus we got an enriched dataset that has the following
                                                             format: n1, match_count, pos, lemma, gram,
                                                                where n1 is a word from the GBNC 1-gram dataset;
                                                             pos, lemma and gram stand for POS-tag, lemmatized
                                                             word form and vector of grammatical features respec-
                                                             tively. Ambiguous words have led to several records in
                                                             the this enriched dataset. For instance,
                                                                n1, match_count, pos, lemma_id, gramA
                                                                n1, match_count, pos, lemma_id, gramB
                                                                where ambiguous word n1 has two sets of grammat-
                                                             ical features: gramA and gramB. In all such cases we
                                                             omit these conflicting rows from the dataset, because tak-
                                                             ing these records into account adds a lot of noise.

                                                             3.3   Dictionary of verbal models construction
                                                             Let us briefly describe a technique we use for generat-
                                                             ing a dictionary of direct subject control. To this end we
                                                             capture all pairs (head, dep) with POS-tag of the head
                                                             part equals to ’VERB’ and having a certain grammatical
                                                             case for the dependent part (dep), say ’gent’ for Genitive.
                                                             Finally, we group all these pairs by “lemma_id” (in or-
                                                             der to regard different forms of the same verb) and count
                                                             the number of records and summate match_count values.
                                                             Basically, we run the following SQL-query against the
                                                             preprocessed dataset:
Figure 1: A simple user interface for definition of event
extraction patterns                                              CREATE TABLE direct_verbal_control as
                                                                   SELECT
3.1   Google Books Ngram Corpus                                      dep_bigrams.lemma_id,
Russian subset of Google Books Ngram Corpus contains                 dep_bigrams.n1,
67,137,666,353 tokens extracted from 591,310 volumes               SUM(CASE
[6], mostly from past three centuries. The most part of              WHEN dep_bigrams.gram LIKE ’%nomn%’
books was drawn from university libraries. Each book                 THEN dep_bigrams.count
was scanned with custom equipment and the text was                   ELSE 0 END) AS nomn,
digitized by means of OCR. Only ngrams that appear                 ...
over 40 times across the corpus are included to dataset.           SUM(CASE
                                                                     WHEN dep_bigrams.gram LIKE ’%loct%’
                                                                     THEN dep_bigrams.count
3.2   Coprus preprocessing                                           ELSE 0 END) AS loct,
                                                                   FROM dep_bigrams
The original GBNC data set contains statistics on occur-
                                                                   WHERE dep_bigrams.pos=’VERB’
rences of n-grams (n=1. . . 5) as well as frequencies of
                                                                   GROUP BY dep_bigrams.lemma_id;
binary dependencies between words. These binary de-
pendencies represent syntactic links between words from
                                                                In this example we have six aggregation (sum) func-
Google Books texts. An accuracy of unlabeled attach-
                                                             tions (one for each grammatical case, e.g. ’loct’ for the
ment for Russian dependency parser reported in [6] is
                                                             Locative). Each aggregation function in the query cal-
86.2%.
                                                             culates total amount of dependency links between verbs
   As GBNC stores all statistics on a year-by-year basis,    given a lemma_id and arbitrary word forms in a cer-
each datafile contain tab-separated data in the following    tain grammatical case. We apply the same technique
format: ngram, year, match_count, volume_count.              when generating model for control of a preposition from
   We have preprocessed the original data set in a special   a 3-gram dataset. Queries differ only in the WHEN-
way. First, for each dependency 2-gram (the same step        condition and GROUP-BY operator that include addi-
for each 3-gram), we have collected all its occurrences      tional restriction on the second word in a 3-gram.
on the whole data set and summate all “match_count”
values since 1900. Aggregated data set consists of pairs
(n-gram, count) for n=2, 3. This step also joined n-grams    4     Results and future work
typed in different cases (lower and upper) into a single     We run two types of queries described in previous section
(lower case) n-gram.                                         against the whole Google Books Ngram dataset. We have
   The next step was to assign each word in a data set       got about 24 thousands of rows (one row per verb) from
a POS-tag and morphological features. For this purpose       the dataset of dependency pairs and about 51.5 thousands
we used a morphological dictionary provided by Open-         of rows from the dataset of 3-grams (a verb + preposition
Corpora [5] to generate POS-tag and morphological fea-       per row). Samples from the resulted dictionary are pro-
tures for 1-grams only.                                      vided in Table 1 and Table 2. The interesting result that
                        Table 1: A part of generated dictionary for few frequent Russian verbs

              Verb            Main case           Genitive    Dative     Accusative     Ablative or Instrumental
              сказать         Dat.                 0.183       0.573       0.057                 0.133
              дать            Dat.                 0.194       0.511       0.252                 0.025
              говорить        Dat.                 0.192       0.434       0.070                 0.166
              писать          Dat.                 0.207       0.389       0.174                 0.123
              указать         Dat.                 0.216       0.377       0.338                 0.056
              изменить        Acc.                 0.131       0.338       0.352                 0.115
              объяснить       Ablt. or Instr.      0.093       0.292       0.113                 0.489
              читать          Acc.                 0.196       0.198       0.449                 0.102


                              Table 2: Control of prepositions for verb “купить” (to buy)

           Verb       Prep.     Main case          Genitive     Dative     Accusative    Ablt. or Instr.   Locative
           купить     для       Gent.                 1.0        0.0          0.0             0.0            0.0
           купить     из        Gent.                 1.0        0.0          0.0             0.0            0.0
           купить     без       Gent.                 1.0        0.0          0.0             0.0            0.0
           купить     до        Gent.                 1.0        0.0          0.0             0.0            0.0
           купить     с         Gent.                0.595       0.0          0.0            0.405           0.0
           купить     в         Loc.                  0.0       0.011        0.068            0.0           0.921
           купить     за        Ablt. or Instr.       0.0        0.0         0.393           0.607           0.0
           купить     к         Dat.                  0.0        1.0          0.0             0.0            0.0
           купить     на        Loc.                  0.0       0.049        0.138           0.005          0.808
           купить     по        Dat.                  0.0        1.0          0.0             0.0            0.0
           купить     под       Ablt. or Instr.       0.0        0.0          0.0             1.0            0.0
           купить     со        Ablt. or Instr.       0.0        0.0          0.0             1.0            0.0


many verbs can subordinate words in almost any gram-                     russian language in knowledge-based ie systems.
matical case. This result differs significantly from the                 2012.
results presented in [4]. We cannot consider this as an
error of our calculation or the parsing method, but rather        [4] Kochetkova N. A. Klyshinsky E. S. Method of au-
as an effect of variations in sense of the verb. It might be          tomatic generating of russian verb control models.
useful to compare our dictionary to the dictionary gener-             In XII National conference of artificial intelligence,
ated from a web corpus [4].                                           2013. In Russian.
   In our future work we will evaluate quality of the             [5] Granovsky D. V. Protopopova E. V. Stepanova M. E.
obtained dictionary. Finally, the will use the dictio-                Surikov A. V. Bocharov V. V., Alexeeva S. V. Crowd-
nary for definition a set of pattern parts (pairs) in our             sourcing morphological annotation. In Computa-
knowledge-based event extraction system. Those pairs                  tional Linguistics and Intellectual Technologies, Pa-
will be marked with event participants manually.                      pers from the Annual International Conference “Di-
                                                                      alogue” (2013), Dialog ’13, 2013.
References
                                                                  [6] Yuri Lin, Jean-Baptiste Michel, Erez Lieberman
[1] George R. Doddington, Alexis Mitchell, Mark A.                    Aiden, Jon Orwant, Will Brockman, and Slav Petrov.
    Przybocki, Lance A. Ramshaw, Stephanie Strassel,                  Syntactic annotations for the google books ngram
    and Ralph M. Weischedel. The automatic content                    corpus. In Proceedings of the ACL 2012 System
    extraction (ace) program - tasks, data, and evalua-               Demonstrations, ACL ’12, pages 169–174, Strouds-
    tion. In LREC, 2004.                                              burg, PA, USA, 2012. Association for Computa-
                                                                      tional Linguistics.
[2] Daria Dzendzik and Sergey Serebryakov. Semi-
    automatic generation of linear event extraction pat-
    terns for free texts. In Natalia Vassilieva, Denis
    Turdakov, and Vladimir Ivanov, editors, SYRCoDIS,
    volume 1031 of CEUR Workshop Proceedings,
    pages 5–9. CEUR-WS.org, 2013.

[3] Valery Solovyev, Vladimir Ivanov, Rinat Ga-
    reev, Sergey Serebryakov, and Natalia Vassilieva.
    Methodology for building extraction templates for