Semi-Automatic Generation of Linear Event Extraction Patterns for Free Texts © Daria Dzendzik Sergey Serebryakov HP Labs HP Labs daria.dzendzik@hp.com sergey.serebryakov@hp.com Abstract patterns for dependency-based parse tree (DPT) struc- tures. In this paper we present our approach to semi- In this paper we describe a semi-automatic ap- automatic generation of linear extraction patterns for an- proach to generating event extraction patterns notation graphs while using DPT patterns in the process for free texts. The algorithm is composed of of constructing the first ones. four steps: we automatically extract possible The paper is structured as follows. In the next section events from a corpus of free documents, cluster we briefly outline related work, in particular, we give an them using dependency-based parse tree paths, overview of the papers that propose algorithms minimiz- validate random samples from each cluster and ing human efforts required to build extraction patterns. generate linear patterns using positive event In section 3 we describe in details four steps of the al- clusters. We compare our algorithm with the gorithm. Section 4 presents the results of experiments. system that uses manually created patterns. We conclude in section 5 by summarizing the results and outlining current and future work. 1 Introduction The event extraction (EE) task is formulated as “to au- 2 Related Work tomatically identify the events in free texts and derive detailed information about them, ideally identifying who One of the most known algorithms that learn multi-slot did what to whom, when, with what methods, where and extraction patterns for free texts is Whisk [10]. It re- why”. Events involve entities and relations between them quires an annotated corpus of documents and syntactic and imply a change of state. For instance, the sentence parse data provided with it. “Palm was acquired by Hewlett-Packard for $1.2 billion The problem of learning conventional character-based two days ago” mentions an event of the type Mergers regular expressions from annotated corpora is described & Acquisitions (M&A) with four arguments - acquirer in [8]. The authors demonstrate the applicability of their (Hewlett-Packard), acquiree (Palm), monetary expres- method in extracting such structured entities as software sion ($1.2 billion) and temporal expression (two days names, URLs, phone and course numbers. It is still an ago). There is also an event indicator (was acquired). open question if their approach can be effectively applied An event indicator is a word or a sequence of words that for extracting higher level structures such as relations or clearly signal about the possible presence of an event. events. EE has been the subject of active research for more Snowball [1, 3, 9] and ExDisco [3, 11] are the ex- than twenty years since the series of MUC conferences amples of algorithms for learning information extraction started in 1987. Initially, EE task was limited to sev- patterns from unannotated corpora. Snowball is a logi- eral domains and had small-sized corpora at its disposal. cal successor of DIRPE [4, 1, 3, 9] which was intended Nowadays, with the rapid evolution of socially-oriented for extracting binary relations from semi-structured web Internet, it is becoming a crucial task for businesses to data. Snowball also extracts binary relations but it is in- analyze millions of documents with multilingual content tended for free texts. The main idea is that there is a each day with minimal latency in order to get insight and small seed of known relations and these relations are to provide critical decisions in time. We believe that mod- be found in the text. After that the algorithm generalizes ern and effective solutions to EE task would allow busi- sentence context (creates a pattern). The next step is to nesses to dramatically minimize the time required to start find new relations within this context. making use of new sources of information and reduce the ExDisco uses a different approach to deal with the operating costs of such systems. problem of absence of annotated corpora. It divides a One of the popular approaches to information extrac- corpus into two parts (relevant and irrelevant documents) tion (IE), and EE as a sub-problem of IE in particular, and uses the following assumption: relevant documents is in the use of domain specific extraction patterns such contain relevant events and relevant events are present in as linear extraction patterns for annotation graphs and relevant documents. ExDisco uses a small seed of 2 or 3 known patterns to first divide documents, then it does the Proceedings of the Ninth Spring Researcher’s Colloquium syntactic analysis of every sentence and uses its results on Database and Information Systems, Kazan, Russia, 2013 for pattern generation. As opposed to conventional IE from relatively small an event is defined as the following triple: annotated corpora, Open Information Extraction (OIE) E T, I, {AN T m   deals with large Web-sized unannotated sets of docu- i , Ai }i=1 (1) ments. The two most known information extraction sys- tems that do OIE are KnowItAll [6, 5, 3] and TextRunner where T is the event type, I is the type of event indica- [12, 5, 3]. They rely upon syntactic information rather tor,the event has m arguments and the i-th argument has than annotated or any lexical data. a name AN T i and type Ai . The core idea at the first step is to generate all possi- ble events for every sentence in the corpus. To do it, we 3 Generating Linear Patterns have built the processing pipeline presented at fig. 1. The primary task of this pipeline is to recognize instances of The motivation behind this research is to minimize hu- event indicators and named entities that might be argu- man efforts in constructing event extraction patterns for ments of events. new types of events. There are three assumptions under- lying our algorithm. The first one is the way how we de- fine the verity of events. We define an event to be a true (positive) event if its indicator and arguments linguisti- cally represent a valid event mention no matter what po- larity, modality etc. the event mention has. For instance, in the sentence “Reuters announces that HP will not ac- quire Palm” the pair {Reuters, announces} represents a valid event mention of the type Company Announcement, while {HP, announces} pair does not. On the other hand, the triple {HP, Palm, will not acquire} represents a valid event mention of the type M&A, while {Palm, Reuters, will not acquire} triple does not. Figure 1: Pipeline that extracts possible events. The algorithm requires the documents to be anno- tated with named entities that might be the arguments The pipeline has a typical architecture intended for of events. They include not only such named entities as extracting named entities. Initially, it splits text into sen- persons, companies, positions and event indicators, but tences and tokens. After that, it uses Named Entity and also temporal and monetary expressions. A complete set Event Indicator Recognizers to extract named entities of required named entities depends on the types of events and event indicators. We use OpenNLP library to ex- to be extracted. We use available NERs (in particular, we tract named entities and a dictionary-based recognizer to use OpenNLP library) which we trust. It means that we extract event indicators. consider texts annotated by appropriate NERs as the gold The last component of the pipeline (Possible Event standard and we do not consider confidence of extracted Extractor) uses previously extracted information together entities (if NERs provide such information) during the with the description of events in order to extract possible process of evaluating the extraction patterns. events. First, this component determines if a sentence We also assume that if two events are extracted by may contain at least one event of any type. To do it, the same DPT pattern, they both are either true or false. it iterates over the description of event types. For each More generally, a group of N events all extracted by event type, it determines if a sentence contains (1) an in- the same DPT pattern contains only either true or false dicator of the appropriate type, and if so (2) a minimal events. A verity of the group is determined by the ver- number of the appropriate named entities. For instance, ity of a random sample drawn from it. This is the main for the event of the type PersonAnnouncement a sentence assumption that allows us to minimize human efforts must contain at least one indicator of the type Announce- required to produce extraction rules by validating only mentIndicator and at least one named entity of the type such a small random sample but not the whole entire Person. For the event of the type M&A, a sentence must group that might contain tens of thousands of events. contain at least one indicator of the type AcquisitionIndi- As we have mentioned, the algorithm is composed of cator and at least two named entities of the type Com- four steps. At the first step we extract possible events pany. If a sentence cannot contain a mention of at least from an unannotated corpus of documents. one event, it is not processed further. If there can be at least one event mention in the 3.1 Extracting Possible Events sentence, we apply the Stanford parser to obtain DBT. For every type of the event T, for which there can be at Any event type is described by an indicator (a word that least one event mention, we generate all possible events. most clearly signals about the event presence, usually, a At the first step, we construct a list of all appropriate verb) and arguments together with their semantic types. event indicators {I}ni=1 found in the sentence. Then For instance, an event of the type M&A is described by an we construct a set of named entities that will be a part event indicator of the type AcquisitionIndicator and two of possible events. We compute the shortest paths arguments: acquirer of the type Company and acquiree in DPT from indicators to every named entity in this of the type Company as well. Another example is an set. We then generate all possible events around each event of the type PersonAnnouncement that is described event indicator. For every possible event, we construct by an indicator of the type AnnouncementIndicator and its pattern id - a non unique string that is composed one argument announcer of the type Person. Formally, of the event type information and DPT paths. Two events of the same type having the same paths from 3.4 Generating Linear Patterns indicators to arguments will have the same pattern Finally, the system annotates the groups of events as pos- id. The format of the pattern id is the following: itive or negative based on the subsets validated by the Indicator Attribute1 T ype (CoveredT ext1 ):P ath1 , assessor at the previous step. This is done by counting Attribute2 T ype (CoveredT ext2 ):P ath2 , ... For the number of true and the number of false events in the instance, the event mentioned in the sentence “Google subset. If there are more positive events in the subset, acquires Neotonic Software” has the pattern id Ac- the whole group is annotated as positive, and vice versa. quisitionIndicator Company(Google):nsubj, Com- Positive groups are used further to generate and general- pany(Neotonic Software):dobj. ize event extraction patterns. It is possible to count the number N of the events that are generated around each indicator. Let us denote the number of distinct types of arguments as m/ , and for each j-th distinct type let Sj be the number of arguments in the event definition of this type, and let Nj be the num- ber of named entities of this type in a sentence. Then there can be Pj (Nj , Sj ) = Nj !/(Nj − Sj )! variants to fill Sj arguments with Nj named entities. To compute the number of possible events, we need to multiply all Qm/ these values N = j=1 Pj (Nj , Sj ). Due to limitations of the Stanford parser, only those sentences that have less than 80 tokens are processed. Figure 3: Typesystem of TextMARKER rule engine. The input data for the pattern generation algorithm 3.2 Grouping Possible Events into Clusters is the sentence annotation graph. The graph contains such annotations as events, their indicators and argu- Once the input corpus has been processed and all pos- ments (named entities) and sentence context which is sible events have been identified, at the second step we presented as types of tokens in the notation of the group the events according to their pattern id. All events TextMARKER[7] type system (fig. 3). We use the inside every group have the same pattern id. In other bottom-up generalization strategy and start with the most words, they are all extracted by the same DPT pattern. specific version of a pattern. We generalize patterns until We sort the groups by cardinality and generate a random the F1-measure of the new ones increases. subset for each group. 1: procedure G ENERALIZE RULE (rule) 2: curF 1 ← 0 3.3 Assessor Validation 3: newF 1 ← T EST RULE(rule) 4: while (newF 1 − curF 1 ) > 0 do At the third step an assessor validates a small random 5: curF 1 ← newF 1 subset of each cluster using UIMA Cas editor1 inside 6: rules ← empty list Eclipse IDE. This editor (fig. 2) highlights text frag- 7: rules.add(B EST RULE(B OTTOM U P G EN(rule))) ments that mention possible events. For every possible 8: rules.add(B EST RULE(Q UANT M ODIF(rule))) event, this tool provides detailed information such as its 9: if N EW RULES U NKNOWN(rules) then indicator and arguments that are also highlighted in the 10: newRule ← B EST RULE(rules) text. The assessor should iterate through every possible 11: newF 1 ← T EST RULE(newRule) event, decide if a particular event is true or false and in 12: rule ← newRule case it is true, set its verity flag to true. 13: end if 14: end while 15: return (rule) 16: end procedure Figure 4: Pattern generalization algorithm. There are two operations that we use during pattern generalizing process (fig. 4): 1. BottomUpGen: bottom-up generalization (see fig. 3). After the type of token is changed, its neighbors are visited. If there are two tokens with the same type close to each other, they are generalized to one token of the same type with an expanded quantifier: “company” “based” “in” “Tel” “Aviv” → SW SW Figure 2: UIMA CAS Editor perspective (Eclipse IDE). SW CW CW → SW[3] CW[2] → W[5] (a) generalizing a word to its type: “recently” → 1 http://uima.apache.org/ SW; “May” → CW; “INC” → CAP (b) generalizing the type of a word to its immedi- ate parent type: Table 2: Splits of the entire dataset (train/test (patterns#)) 1st split 2nd split SW, CW, CAP → W M&A 61/22 (8) 55/28 (8) (c) generalizing the types of punctuation marks to MPC 38/7 (9) 30/15 (8) their parent type: Res 54/26 (10) 56/24 (10) COLON, COMMA, SEMICOLON → PM (d) generalizing the types to their parent types: manually created ones in terms of precision, recall and W, PM → ANY F1-measure on both the train and test sets in all splits (ta- ble 3), thought the number of automatically constructed 2. QuantModif : quantifier modification (expan- patterns (8) is smaller than the number of manually cre- sion/restriction of a quantifier) ated ones (10). Note that in the first split on the test data and in the second split on the train data manually con- (a) Expansion: SW[3] → SW[2,4] structed patterns did not extract any events, that is why (b) Restriction: CW[2,7] → CW[2,6] we do not provide the performance for these cases. Using these two operations, we iteratively generalize the patterns. At every step we select the best pattern and Table 3: M&A patterns performance (train / test) we stop the generalizing process when the F1-measure Manually Automatically 2nd split 1st split stops improving (newF 1 ). We keep a list of previously Precision 0.50 / — 1.00 / 1.00 generalized patterns. If the current pattern has already Recall 0.11 / 0.00 1.00 / 0.33 been processed, it is not generalized further. This corre- F1-measure 0.18 / — 1.00 / 0.50 sponds to line 9 of the algorithm (fig. 4). Precision — / 0.50 1.00 / 1.00 Recall 0.00 / 0.50 1.00 / 0.50 4 Experiments F1-measure — / 0.50 1.00 / 0.66 Sentences annotated by the assessor with true events are In case of MPC events, in the first split we got higher considered as the gold standard and are used to measure results using automatically constructed patterns (see ta- the performance of event extraction patterns. Once the ble 4). In the second split, manually and automatically generated patterns had been applied over the gold stan- constructed patterns demonstrated the same results, how- dard, we compared the gold event annotations with the ever, our algorithm constructed 8 patterns as opposed to event annotations created by the patterns under evalu- 3 patterns constructed by a human. ation. Two event annotations are considered to be the same if they cover the same event indicators and argu- ments (named entities). Table 4: MPC patterns performance (train / test) We ran preliminary experiments on the English part of Manually Automatically Reuters (RCV2)2 dataset [2]. Table 1 presents the details 2nd split 1st split Precision 0.78 / 0.50 0.92 / 1.00 of the entire dataset that we used (after processing all En- Recall 0.58 / 0.33 1.00 / 0.33 glish articles on the pipeline that detects possible events). F1-measure 0.67 / 0.40 0.96 / 0.50 We ran experiments on extracting three types of events - Precision 0.67 / 1.00 1.00 / 1.00 Recall 0.60 / 0.40 1.00 / 0.40 Table 1: Dataset description F1-measure 0.63 / 0.57 1.00 / 0.57 type of event M&A MPC Res number of sentences 83 45 80 Automatically constructed patterns for Res events also number of true events 12 16 25 outperformed those manually created in terms of F1- measure in the first and second splits. However, the Mergers & Acquisitions (M&A), Management Position precision of manually constructed patterns was higher in Change (MPC) and Resignation (Res). We compare the both splits. performance of automatically constructed patterns with In the second experiment, we tried to get more real- the patterns that we built manually based on RSS arti- istic comparison results of manually and automatically cles obtained from such sources as Yahoo and Google constructed patterns. What we did was validation of news feeds. In total, we manually created 10 patterns for manually constructed patterns on the entire dataset (ta- M&A, 3 patterns for MPC and 5 patterns for Res events. ble 7) and ran the algorithm through a 5-fold cross val- In the first experiment, we created 2 train/test splits. idation 5 times (due to the fact that we used a relatively In each split, we constructed rules on the train set and small dataset). The results are presented in table 6. validated them on the test one. Table 2 provides split Afterwards, we tested manually constructed patterns details as well as the number of patterns that have been on the entire dataset. The results are given in table 7. constructed automatically in each case. Tables 3-5 give As it can be seen, we got quite varying results for dif- an overview of the results of experiments. For M&A ferent partitions (F1-measure: 0.4523-0.8333 for M&A, event, automatically constructed patterns outperformed 0.5428-0.7023 for MPC and 0.5000- 0.6764 for Res 2 Initially, we wanted to use RCV1 that is much bigger. However, events). A conclusion that can be made from these results due to performance characteristics of the first implementations of the is that we need a larger and more representative dataset algorithm in terms of execution time, we decided to user smaller corpus to construct and evaluate comprehensive event extraction and experiment on RCV1 after we optimize the algorithm. patterns. Table 5: Res patterns performance (train / test) Table 7: Performance of manually constructed rules on Manually Automatically the entire dataset 2nd split 1st split Precision 1.00 / 1.00 0.83 / 0.50 M&A MPC Res Recall 0.18 / 0.33 0.86 / 0.67 Precision 0.5 0.73 1.0 F1-measure 0.30 / 0.50 0.84 / 0.57 Recall 0.08 0.53 0.2 Precision 1.00 / 1.00 0.90 / 0.33 F1-measure 0.14 0.62 0.33 Recall 0.19 / 0.25 0.90 / 0.75 F1-measure 0.32 / 0.40 0.90 / 0.46 Databases, WebDB ’98, pages 172–183, London, UK, UK, 1999. Springer-Verlag. Table 6: F1-measure of automatically constructed rules [5] Oren Etzioni, Michele Banko, Stephen Soderland, (5-fold cross validation 5 times) and Daniel S. Weld. Open information extraction M&A MPC Res from the web. Commun. ACM, 51(12):68–74, De- cember 2008. 0.4523 0.5428 0.5000 0.4666 0.5714 0.5988 [6] Oren Etzioni, Michael Cafarella, Doug Downey, 0.4888 0.6750 0.6321 Stanley Kok, Ana-Maria Popescu, Tal Shaked, 0.5555 0.6904 0.6426 Stephen Soderland, Daniel S. Weld, and Alexander 0.8333 0.7023 0.6764 Yates. Web-scale information extraction in know- itall: (preliminary results). In Proceedings of the 5 Conclusions and Future Work 13th international conference on World Wide Web, In this paper we gave an overview of the current progress WWW ’04, pages 100–110, New York, NY, USA, in developing the algorithm for semi-automatic genera- 2004. ACM. tion of linear event extraction patterns for free texts and [7] Peter Kluegl, Martin Atzmueller, and Frank Puppe. presented our preliminary experimental results. Our next Integrating the rule-based ie component textmarker steps are to enhance the algorithm with additional capa- into uima. In Joachim Baumeister and Martin Atz- bilities, optimize it in terms of execution performance mueller, editors, LWA-2008 (Special Track on In- and validate it on a larger dataset. formation Retrieval), pages 73–77, 2008. Currently, we extract only mandatory events’ argu- ments. We will add capability to generalize patterns that [8] Yunyao Li, Rajasekar Krishnamurthy, Sriram will include optional arguments as well (such as tempo- Raghavan, Shivakumar Vaithyanathan, and H. V. ral and monetary expressions). We plan to explore the Jagadish. Regular expression learning for infor- possibility to improve and optimize the algorithm. We mation extraction. In Proceedings of the Confer- will work on enhancing the generalization algorithm by ence on Empirical Methods in Natural Language providing additional operations as well as an intelligent Processing, EMNLP ’08, pages 21–30, Strouds- selection of operations to apply at each iteration. We will burg, PA, USA, 2008. Association for Computa- also explore the possibility to employ negative examples. tional Linguistics. We have a dataset of approximately 110 000 news arti- cles that we will use for validating the algorithm. We will [9] Ryan McDonald. Extracting relations from un- also use Reuters RCV1 dataset for this purpose. We will structured text. Technical report, 2005. study in much more details theoretical properties of the [10] Stephen Soderland. Learning information extrac- proposed algorithm. tion rules for semi-structured and free text. Mach. Learn., 34(1-3):233–272, February 1999. References [11] Roman Yangarber, Ralph Grishman, and Pasi [1] Eugene Agichtein and Luis Gravano. Snowball: ex- Tapanainen. Automatic acquisition of domain tracting relations from large plain-text collections. knowledge for information extraction. In In Pro- In Proceedings of the fifth ACM conference on Dig- ceedings of the 18th International Conference on ital libraries, DL ’00, pages 85–94, New York, NY, Computational Linguistics, pages 940–946, 2000. USA, 2000. ACM. [12] Alexander Yates, Michele Banko, Matthew Broad- [2] Massih-Reza Amini, Nicolas Usunier, and Cyril head, Michael J. Cafarella, Oren Etzioni, and Goutte. Learning from multiple partially observed Stephen Soderland. Textrunner: Open information views - an application to multilingual text cate- extraction on the web. In HLT-NAACL (Demonstra- gorization. In Advances in Neural Information tions)’07, pages 25–26, 2007. Processing Systems 22 (NIPS 2009), pages 28–36, 2009. [3] Nguyen Bach and Sameer Badaskar. A Review of Relation Extraction. 2007. [4] Sergey Brin. Extracting patterns and relations from the world wide web. In Selected papers from the In- ternational Workshop on The World Wide Web and