<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating a dictionary of control models for event extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Proceedings of the Tenth Spring Researcher's Colloquium</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>on Database and Information Systems</institution>
          ,
          <addr-line>Veliky Novgorod, Russia, 2014</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>A subordination dictionary is important in a number of text processing applications. We present a method for generating such dictionary for Russian verbs using Google Books Ngram data. An intended purpose of the dictionary is an event extraction system for Russian that uses the dictionary to define extraction patterns.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Event extraction is an important task in information
extraction from unstructured text. This task attracted
number of researcher in last decade. An event extraction
system aims at capturing certain parts of a text (e.g. event
type, participants and attributes). One of the central
concepts in event extraction is a trigger word (usually a
separate verb) denoting a type of an event [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. On one hand,
the trigger word indicates presence of an event in a
sentence. On the other hand, the trigger is considered as a
main part in knowledge-based (KB) approach to event
extraction.
      </p>
      <p>
        According to this approach, rules (or patterns) and
dictionaries are used. These patterns may be generated
automatically [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or defined manually [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, in
languages with free word order (e.g. Russian) a
developer of that patterns should also take into account all
possible arrangements of words in a sentence. In this
case it is more natural to define pattern parts as
independent pairs: "event-participant" which will be
automatically mapped to "predicate-argument" pairs that denote a
subordination in a parse tree of a sentence at hand. Thus
a complete subordination dictionary becomes a crucial
element of a knowledge-based event extraction system.
A well-known limitation of recent works in this area is
insufficient dictionary size that prevents using such
dictionaries in a computer system.
      </p>
      <p>
        In 2013 Klyshinsky et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] generated such
dictionary for Russian verbs using a set of web corpora; all
corpora together contain about 10-11 billion tokens.
Authors proposed a method for automatic generation of
dictionary for verbs and prepositions. Klyshinsky et al.
reported that the dictionary size was about 25-30 thousand
verbs. Their method deals only with lexical information,
i.e. extraction of verb(-preposition)-noun dependencies
was done with six simple finite automata, and no
parsing step was performed. Treebanks of Russian language
also have insufficient corpus size for automatic
generation of a complete (for most Russian verbs)
subordination dictionary. The main difference with previous woks
that ambiguous part of text was not processed at all.
Resulted set was filtered to exclude case ambiguity,
infrequent words and ngrams that are not allowed in Russian
grammar. The dictionary was evaluated on a corpora of
Russian fiction texts and texts from news site and showed
good results.
      </p>
      <p>
        In this paper we present a alternative method for
generating a subordination dictionary using a Google Books
Ngram Corpus (contains of 67 billion tokens). Main
motivation behind this work is to facilitate an event
extraction system for Russian that is focused on event types
described in ACE [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Here we consider the case when a
trigger is the main verb (or predicate) that acts as a
syntactic head for all participants of a corresponding event
(participants of the event act as syntactical arguments of
the predicate). We start with a brief overview of user
interface that can be used for both pattern definition and
dictionary correction. Then we describe the method for
generation a subordination dictionary.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>User interface for pattern and dictionary construction</title>
      <p>For managing our dictionary we developed a user
interface, shown in Figure 1, that allows to define non-linear
extraction patterns. A type of the event can be chosen
from a drop-down in the top bar. The panel below shows
argument types for the event type. There is an interface
for dealing with verbs. Existing verbs can be edited and
new verbs can be added. In a simple tabular interface
user can set preposition, grammatical case of the
argument and select participant type. For a few events and
triggers using this application for filling dictionary might
be enough, but it becomes harder to define all the
prepositions and relevant cases as the number of event types
and verbs grows.</p>
      <p>The method we propose for subordination dictionary
generation is based on processing Google Books Ngram
data set. The study was carried out for Russian, but
this method is applicable to other languages for which
Google Books Ngram Corpus and morphological
dictionary are available.
3</p>
    </sec>
    <sec id="sec-3">
      <title>A subordination dictionary</title>
      <p>
        The main idea is based on using the Google Books
Ngram Corpus (GBNC) that was enriched with
morphological information and filtered with certain rules.
Russian subset of Google Books Ngram Corpus contains
67,137,666,353 tokens extracted from 591,310 volumes
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], mostly from past three centuries. The most part of
books was drawn from university libraries. Each book
was scanned with custom equipment and the text was
digitized by means of OCR. Only ngrams that appear
over 40 times across the corpus are included to dataset.
3.2
      </p>
      <sec id="sec-3-1">
        <title>Coprus preprocessing</title>
        <p>
          The original GBNC data set contains statistics on
occurrences of n-grams (n=1. . . 5) as well as frequencies of
binary dependencies between words. These binary
dependencies represent syntactic links between words from
Google Books texts. An accuracy of unlabeled
attachment for Russian dependency parser reported in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is
86.2%.
        </p>
        <p>As GBNC stores all statistics on a year-by-year basis,
each datafile contain tab-separated data in the following
format: ngram; year; match_count; volume_count.</p>
        <p>We have preprocessed the original data set in a special
way. First, for each dependency 2-gram (the same step
for each 3-gram), we have collected all its occurrences
on the whole data set and summate all “match_count”
values since 1900. Aggregated data set consists of pairs
(n-gram, count) for n=2, 3. This step also joined n-grams
typed in different cases (lower and upper) into a single
(lower case) n-gram.</p>
        <p>
          The next step was to assign each word in a data set
a POS-tag and morphological features. For this purpose
we used a morphological dictionary provided by
OpenCorpora [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to generate POS-tag and morphological
features for 1-grams only.
        </p>
        <p>Thus we got an enriched dataset that has the following
format: n1; match_count; pos; lemma; gram,
where n1 is a word from the GBNC 1-gram dataset;
pos, lemma and gram stand for POS-tag, lemmatized
word form and vector of grammatical features
respectively. Ambiguous words have led to several records in
the this enriched dataset. For instance,
n1; match_count; pos; lemma_id; gramA
n1; match_count; pos; lemma_id; gramB
where ambiguous word n1 has two sets of
grammatical features: gramA and gramB. In all such cases we
omit these conflicting rows from the dataset, because
taking these records into account adds a lot of noise.
3.3</p>
      </sec>
      <sec id="sec-3-2">
        <title>Dictionary of verbal models construction</title>
        <p>Let us briefly describe a technique we use for
generating a dictionary of direct subject control. To this end we
capture all pairs (head, dep) with POS-tag of the head
part equals to ’VERB’ and having a certain grammatical
case for the dependent part (dep), say ’gent’ for Genitive.
Finally, we group all these pairs by “lemma_id” (in
order to regard different forms of the same verb) and count
the number of records and summate match_count values.
Basically, we run the following SQL-query against the
preprocessed dataset:</p>
        <p>CREATE TABLE direct_verbal_control as
SELECT
dep_bigrams.lemma_id,
dep_bigrams.n1,
SUM(CASE</p>
        <p>WHEN dep_bigrams.gram LIKE ’%nomn%’
THEN dep_bigrams.count</p>
        <p>ELSE 0 END) AS nomn,
...</p>
        <p>SUM(CASE</p>
        <p>WHEN dep_bigrams.gram LIKE ’%loct%’
THEN dep_bigrams.count</p>
        <p>ELSE 0 END) AS loct,
FROM dep_bigrams
WHERE dep_bigrams.pos=’VERB’</p>
        <p>GROUP BY dep_bigrams.lemma_id;</p>
        <p>In this example we have six aggregation (sum)
functions (one for each grammatical case, e.g. ’loct’ for the
Locative). Each aggregation function in the query
calculates total amount of dependency links between verbs
given a lemma_id and arbitrary word forms in a
certain grammatical case. We apply the same technique
when generating model for control of a preposition from
a 3-gram dataset. Queries differ only in the
WHENcondition and GROUP-BY operator that include
additional restriction on the second word in a 3-gram.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and future work</title>
      <p>We run two types of queries described in previous section
against the whole Google Books Ngram dataset. We have
got about 24 thousands of rows (one row per verb) from
the dataset of dependency pairs and about 51.5 thousands
of rows from the dataset of 3-grams (a verb + preposition
per row). Samples from the resulted dictionary are
provided in Table 1 and Table 2. The interesting result that</p>
      <p>
        1.0
1.0
1.0
1.0
0.595
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.011
0.0
1.0
0.049
1.0
0.0
0.0
many verbs can subordinate words in almost any
grammatical case. This result differs significantly from the
results presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We cannot consider this as an
error of our calculation or the parsing method, but rather
as an effect of variations in sense of the verb. It might be
useful to compare our dictionary to the dictionary
generated from a web corpus [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In our future work we will evaluate quality of the
obtained dictionary. Finally, the will use the
dictionary for definition a set of pattern parts (pairs) in our
knowledge-based event extraction system. Those pairs
will be marked with event participants manually.
russian language in knowledge-based ie systems.
2012.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>George</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Doddington</surname>
          </string-name>
          , Alexis Mitchell, Mark A.
          <string-name>
            <surname>Przybocki</surname>
          </string-name>
          , Lance A.
          <string-name>
            <surname>Ramshaw</surname>
          </string-name>
          , Stephanie Strassel, and
          <string-name>
            <surname>Ralph</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Weischedel</surname>
          </string-name>
          .
          <article-title>The automatic content extraction (ace) program - tasks, data, and evaluation</article-title>
          .
          <source>In LREC</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Daria</given-names>
            <surname>Dzendzik</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Serebryakov</surname>
          </string-name>
          .
          <article-title>Semiautomatic generation of linear event extraction patterns for free texts</article-title>
          . In Natalia Vassilieva, Denis Turdakov, and Vladimir Ivanov, editors,
          <source>SYRCoDIS</source>
          , volume
          <volume>1031</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <fpage>5</fpage>
          -
          <lpage>9</lpage>
          . CEUR-WS.org,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Valery</given-names>
            <surname>Solovyev</surname>
          </string-name>
          , Vladimir Ivanov, Rinat Gareev, Sergey Serebryakov, and
          <string-name>
            <given-names>Natalia</given-names>
            <surname>Vassilieva</surname>
          </string-name>
          .
          <article-title>Methodology for building extraction templates for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Kochetkova</surname>
            <given-names>N. A. Klyshinsky E. S.</given-names>
          </string-name>
          <article-title>Method of automatic generating of russian verb control models</article-title>
          .
          <source>In XII National conference of artificial intelligence</source>
          ,
          <year>2013</year>
          . In Russian.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Granovsky</surname>
            <given-names>D. V.</given-names>
          </string-name>
          <string-name>
            <surname>Protopopova E. V. Stepanova M. E. Surikov</surname>
            <given-names>A. V. Bocharov V. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alexeeva S</surname>
          </string-name>
          . V.
          <article-title>Crowdsourcing morphological annotation</article-title>
          .
          <source>In Computational Linguistics and Intellectual Technologies, Papers from the Annual International Conference “Dialogue”</source>
          (
          <year>2013</year>
          ), Dialog '
          <volume>13</volume>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Yuri</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jean-Baptiste</surname>
            <given-names>Michel</given-names>
          </string-name>
          , Erez Lieberman Aiden, Jon Orwant, Will Brockman, and
          <string-name>
            <given-names>Slav</given-names>
            <surname>Petrov</surname>
          </string-name>
          .
          <article-title>Syntactic annotations for the google books ngram corpus</article-title>
          .
          <source>In Proceedings of the ACL 2012 System Demonstrations, ACL '12</source>
          , pages
          <fpage>169</fpage>
          -
          <lpage>174</lpage>
          , Stroudsburg, PA, USA,
          <year>2012</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>