Sentence-Based Active Learning Strategies for Information Extraction Andrea Esuli, Diego Marcheggiani∗ and Fabrizio Sebastiani Istituto di Scienza e Tecnologie dell’Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, 1 – 56124 Pisa, Italy firstname.lastname@isti.cnr.it ABSTRACT tagset of interest, the predicted sequences of tokens coincide Given a classifier trained on relatively few training exam- with the true sequences. ples, active learning (AL) consists in ranking a set of un- In text classification and other text learning tasks differ- labeled examples in terms of how informative they would ent from IE, the units of ranking and the units of annotation be, if manually labeled, for retraining a (hopefully) better are the same; e.g., in text classification, it is the texts them- classifier. An important text learning task in which AL is selves that are ranked, and it is the texts themselves that potentially useful is information extraction (IE), namely, the are then annotated in their entirety by the human anno- task of identifying within a text the expressions that instan- tator. IE is peculiar from this standpoint since, while the tiate a given concept. We contend that, unlike in other text units of annotation are the tokens, it does not make sense to learning tasks, IE is unique in that it does not make sense rank individual tokens: if this were to happen, an annotator to rank individual items (i.e., word occurrences) for anno- would be presented with “tokens in context” (i.e., a token in tation, and that the minimal unit of text that is presented the fixed-size window of text in which the token occurs) and to the annotator should be an entire sentence. In this paper asked to annotate the token, with the consequence that she we propose a range of active learning strategies for IE that might be asked to read the same context several times, for are based on ranking individual sentences, and experimen- annotating neighbouring tokens. tally compare them on a standard dataset for named entity In this paper we take the view that the optimal unit of extraction. ranking is the sentence. This means that all the sentences of the automatically annotated texts are going to be ranked and presented to the annotator, who will then annotate all Keywords the tokens of a few sentences, starting from the top-ranked Information extraction, named entity recognition, active learn- ones. This is different from several other works in the field ing, selective sampling [6, 8, 12], in which the unit of ranking is a portion of text smaller than a sentence, i.e., a predicted sequence embedded 1. INTRODUCTION in a fixed-sized text window a few words long. The problem In many applicative contexts involving supervised learning, with the latter approach is that, by focusing on predicted se- labeled data may be scarce or expensive to obtain, while quences, the classification mistakes that the annotator cor- unlabeled data, even sampled from the same distribution, rects are the false positives, while the false negatives are may abound. In such situations it may be useful to employ never brought to the light. This results in an imbalanced an algorithm that ranks the unlabeled examples and asks training set being fed to the learner. a human annotator to label a few of them, starting from We deem the sentence to be the optimal unit of ranking the top-ranked ones, so as to provide additional highly in- for additional reasons: formative training data. The task of this algorithm is thus to rank the unlabeled examples in terms of how informa- • An entire sentence offers more context for actually in- tive they would be, once labeled, for the supervised learning terpreting the tokens and the sequences within it than task. The discipline that studies these algorithms is called the fixed-size window often used in the literature. This (pool-based) active learning (aka selective sampling). This is especially important in complex IE tasks such as paper focuses on the application of active learning to infor- opinion extraction (see e.g., [2, 5]), in which, given the mation extraction (IE), the task of annotating sequences of variety of devices that language has for conveying opin- one or more words (aka tokens) in a text by means of tags ions, and given the uncertain boundary between fact representing concepts of interest. The hypothetically per- and opinion, the annotator needs to take very subtle fect IE system is thus the one for which, for each tag in the decisions. ∗ Corresponding author • Different sentences never overlap, while different fixed- length windows may do. The sentence-based approach results in smaller annotation effort, since the same to- ken is never examined twice by the annotator. Appears in the Proceedings of the 1st Italian Information Retrieval • From a semantic point of view, sentences are fairly Workshop (IIR’10), January 27–28, 2010, Padova, Italy. self-contained units. This means that using portions of http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. text larger than sentences (e.g., paragraphs) as rank- two separators, might be annotated with the PER (“person ing units is unnecessary, also given that it is hardly the name”) tag. Such sequences of tokens will here be referred case that an annotation crosses the boundary between to as annotated sequences (ASs); the expressions true AS two consecutive sentences. Conversely, with a fixed- and predicted AS will refer to ASs according to Φ and Φ̂, size window centered around a predicted sequence, an- respectively. Note that the reason for considering separa- other true sequence may cross the boundary between tors to be the object of tagging too is that the IE system the window and its neighbouring text. should correctly identify sequence boundaries. For instance, given the expression “Barack Obama, Hillary Clinton and In the past, typical strategies adopted in AL for generic Joe Biden” the perfect IE system will attribute the PER tag, learning tasks have relied on ranking objects based either among others, to the tokens “Barack”, “Obama”, “Hillary”, on the classification score attributed by the classifier to the “Clinton”, and to the separators (in this case: blank spaces) object (relevance sampling), or on the confidence score with between “Barack” and “Obama” and between “Hillary” and which the classifier has classified it (uncertainty sampling) “Clinton”, but not to the separator “, ” between “Obama” [9]. In IE, if we want to rank entire sentences we have to and “Hillary”. If the IE system does so, this means that come to terms with the fact that each token in the sentence it has correctly identified the boundaries of the sequences has obtained a classification and a confidence score for each “Barack Obama” and “Hillary Clinton”. tag in the previous classification round, and we thus have to Note that “single-tag” IE means that each token (resp., generate a sentence-specific score out of the token- and tag- separator) has exactly one tag. This is different from multi- specific scores, for all the tokens contained in the sentence tag IE, in which it is assumed that a given token (resp., and all the tags in the tagset. separator) may have more than one tag (opinion extraction The main contribution of this paper consists in proposing – see e.g., [5] – is an instance of multi-tag IE). several alternative strategies for combining the token- and tag-specific scores into a sentence-specific score, and com- 2.2 Sentence-Based AL strategies for IE paring these strategies experimentally. Our experimental work is focused on comparing a range of We remark that this paper does not deal with active learn- active learning strategies for IE that are based on ranking ing algorithms for specific supervised learning devices (such individual sentences. This section describes the strategies as e.g., [13] for text classification), but presents active learn- and the intuitions supporting them. ing strategies that are independent of the learning device In this work we test two alternative learning devices, sup- and that are thus in principle suitable for use with any such port vector machines (SVMs) (see e.g., [1]), and conditional device. random fields (CRFs) [7]. For SVMs we have adopted a The rest of the paper is organized as follows. Our strate- widely used method to realize a multiclass classifier as a com- gies for performing AL in IE are described in Section 2. In bination of binary classifiers, i.e., a one versus all method. Section 3 we move to describing our experiments and the The one versus all method consists in learning m binary clas- experimental protocol we have followed. We conclude in sifiers Φ̂c : T → R, each one trained using as the positive Section 4 by pointing out avenues for future work. examples all the tokens in the training set T r that are la- beled with c, and as negative examples all the other tokens, 2. ACTIVE LEARNING STRATEGIES FOR regardless of the original label. The multiclass classifier is INFORMATION EXTRACTION then defined as Φ̂(t) = arg maxc∈C∪{c∅ } Φ̂c (t), i.e., the as- signed label is the one whose binary classifier scored the 2.1 Preliminaries: Information Extraction maximum confidence. CRFs are a discriminative probabilistic learning method This paper focuses on the application of active learning to based on an undirected graph model, and is frequently used (single-tag) information extraction (STIE, or simply IE). for labeling sequential data, e.g., a sequence of words com- Let a text T consist of a sequence T = {t1 ≺ s1 ≺ . . . ≺ posing a text. Given a token t, a CRFs classifier estimates sn−1 ≺ tn } of tokens (i.e., word occurrences) and separators the likelihood Φ̂c (t) = P (c|t) for each c ∈ C ∪{c∅ } and, simi- (i.e., sequences of blanks and punctuation symbols), where larly to SVMs, the assigned label is the one scoring the high- “≺” means “precedes in the text”. Let C = {c1 , . . . , cm } est Φ̂c (t) value. CRFs are nowadays considered the state- be a predefined set of tags (aka labels, or classes), and let of-the-art learning device for information extraction [11]. c∅ 6∈ C be a special tag (to be read as “no tag”). We define The strategies we propose are based on two concepts, (single-tag) information extraction as the task of estimating label score and tag score. The label score of a token is an unknown target function Φ : T → C ∪ {c∅ } that specifies equal to ls(t) = maxc∈C∪{c∅ } Φ̂c (t), i.e., the maximum con- the true tag in C ∪{c∅ } attached to each token ti ∈ T and to fidence score that determines the decision taken by the clas- each separator si ∈ T . The result Φ̂ : T → C ∪ {c∅ } of this sifier Φ̂(t). The tag score is instead defined as ts(t) = estimation is called the tagger (or wrapper, or classifier )1 . A max{c∈C} Φ̂c (t), i.e., the maximum confidence that the clas- further property of both Φ and Φ̂ is that they can attribute sifier as on considering a token as belonging to a tag, regard- a tag cj to a separator si only if they also attribute the same less of the confidence with respect to c∅ . tag to both ti−1 and ti . In most IE tasks it is actually the case that, rather than 2.2.1 Tag score-based strategies isolated tokens and separators, sequences of consecutive to- The following strategies are based on combining the label kens and separators are annotated with a given tag; e.g., scores assigned to the tokens in the sentence, following the the sequence “George W. Bush”, containing three tokens and intuition that the elements on which the classifier has low 1 confidence could be more useful to the learner, so as to Consistently with most mathematical literature we use the caret symbol (ˆ) to indicate estimation. gather knowledge on “difficult” cases. The Min Min Confidence (MMC) strategy assigns to the The Round Robin Max Score (RRMS) strategy assigns, sentence a value equal to the minimum tag score value among for each c ∈ C, a relevance score to the sentence equal to the tokens composing it, i.e., the maximum score obtained by the tokens contained in it, i.e., M M C(s) = mint∈s (ts(t)) (1) RRM Sc (s) = maxt∈s (Φ̂c (t)) (8) Sentence ranking is performed in increasing order of M M C(s) value. Then a round robin selection process is performed on the Min Average Confidence (MAC) is a version of MMC that |C| rankings produced. tries instead to be robust to single “extreme” evaluations, Similarly to MAS, Round Robin Average Score (RRAS) averaging the tag scores of all the tokens composing the uses averaging instead of maximization, i.e., sentence, i.e., RRASc (s) = avgt∈s (Φ̂c (t)) (9) M AC(s) = avgt∈s (ts(t)) (2) The Round Robin Max Tag Ratio (RRMTR) strategy ap- plies instead the MTR strategy considering the various tags 2.2.2 Label score-based strategies separately from each other, so as to avoid favouring the most Symmetrically to the tag-score-based strategies, the label- frequent tags over the most infrequent. score-based strategies follow the somehow different intuition that the elements on which the classifier has high confidence 3. EXPERIMENTS could be useful, so that the strong beliefs of the learner are confirmed when correct or corrected when a blatant error is 3.1 Experimental setting found. The dataset we have used for evaluating our strategies is the The Max Max Score (MMS) strategy assigns to each sen- CoNLL2003 named entity extraction dataset. The dataset tence a value equal to the highest label score among the consists of 1,393 Reuters newswire articles, for a total of tokens composing it, i.e., 301,418 tokens. The tagset consists of 4 tags (LOC, PER, M M S(s) = maxt∈s (ls(t)) (3) ORG, MISC, standing for “location”, “person”, “organiza- tion”, and “miscellaneous”, respectively) plus the special tag Sentence ranking is performed in decreasing order of M AS(s). O, which tags any token / separator not tagged by any tag Similarly to MAC, Max Average Score (MAS) instead in {LOC, PER, ORG, MISC}. The tokens inside the cor- averages the label scores of all the tokens composing the pus are tagged as follows: 10,645 tokens are tagged as LOC, sentence, i.e., 9,323 as ORG, 10,059 as PER, 5,062 as MISC, while the M AS(s) = avgt∈s (ls(t)) (4) remaining 266,329 are tagged as O. We used a version of the CoNLL corpus already preprocessed with Pianta and Zanoli’s Tagpro system [10], a PoS-tagging system based on 2.2.3 Tag count-based strategies YamCha that computes features such as prefixes, suffixes, The following strategies are instead based on counting the orthographic information (e.g., capitalization, hyphenation) number of tokens that are given a tag different from c∅ by and morphological features, as well as PoS tags and chunk the classifier. tags. These features altogether form the vectorial represen- The Max Tag Count (MTC) strategy counts the number tations of tokens and separators that are fed to the learning of tokens in the sentence that are given a tag different from device. c∅ , i.e., For this latter, we have tested two alternative, off-the-shelf M T C(s) = |{t ∈ s|Φ̂(t) ∈ C}| (5) packages, i.e., YamCha2 and CRF++3 , respectively based on support vector machines and conditional random fields. Sentence ranking is performed in decreasing order of M T C(s) We evaluate the results of our experiments using the F1 value. measure on a token & separator evaluation model [3]. The Since MTC naturally favours long sentences, we have also token & separator model considers each token and each sep- tested a strategy (Max Tag Ratio – MTR) that normalizes arator as being the objects of tagging; for instance, given the values by sentence length, i.e., tag c, the TP (“true positives”) entry of the contingency ta- |{t ∈ s|Φ̂(t) ∈ C}| ble for c consists in the number of tokens that are correctly M T R(s) = (6) assigned token c plus the number of separators that are cor- |s| rectly assigned token c. Once the contingency tables for all The Medium Tag Ratio (MedTR) strategy instead top-ranks the tags in C have been filled, the evaluation is done by the sentences with a tag ratio closer to the average tag ratio using standard micro-averaged and macro-averaged F1 . measured on the training set, i.e., 3.2 Experimental protocol M T R(s) M edT R(s) = (7) In this work we adopt the following iterative experimental avgs0 ∈T r M T R(s0 ) protocol. The protocol has three integer parameters α, β, and γ. Let Ω be a set of natural language sentences parti- 2.2.4 Round Robin-based strategies tioned into a training set T r and a test set T e, and let σ be While the previous strategies always combine the confidence an active learning strategy: values returned on the various tag types, the following strate- gies are based on computing values separately for each tag, 1. Set an iteration counter t = 0; 2 then selecting the most informative sentences using a “round http://www.chasen.org/~taku/software/YamCha/ 3 robin” selection process across all the tags. http://crfpp.sourceforge.net/ 2. Set the current training set T rt to the set of the first α 3.3 Results sentences of T r; set the current “unlabeled set” Ut ← The main results of our experiments are summarized in Ta- T r/T rt ; ble 1. This table reports, for each individual strategy, the 3. For t = 1, . . . , β repeat the following steps: values of F1µ and F1M obtained after 20 training sessions resulting from the protocol of Section 3.2, with α = 110, (a) Generate a classifier Φ̂t from the current training β = 20, and γ = 200, using the two different learners, SVMs set T rt ; and CRFs. (b) Evaluate the effectiveness of Φ̂t on T e; Quite surprisingly, the only genuine strategy that outper- (c) Classify Ut by means of Φ̂t ; forms the random baseline is the MAC strategy. The rel- ative improvement of MAC over RAND ranges from 3.9% (d) Rank Ut according to strategy σ, thus generating up to 6.3%. This improvement matches our expectations, the ranking σ(Ut ); given the close relation between the MAC strategy with the (e) Let r(Ut , γ) be the smallest prefix of σ(Ut ) (i.e., uncertainty sampling [9] method which already proved to be the smallest number of top-ranked elements of effective for AL. σ(Ut )) that contains at least γ tokens; set T rt+1 ← Surprisingly, all the other strategies perform worse or no T rt ∪ r(Ut , γ); set Ut+1 ← Ut /r(Ut , γ). better than the random baseline. In order to understand the It is important to remark that Step 3b has only the purpose possible motivations behind these results we have inspected of collecting the results for experimental purposes (i.e., for the sentences selected by the various strategies at the var- producing the tables of Section 3.3); since it uses the test ious iterations. This inspection allowed us to draw some set T e, its results should obviously not be (and are not) specific conclusions on some of the strategies, and a general accessible to the algorithm. observation for the entire pool of strategies. The above protocol simulates the activity of a human an- The MTR and RRMTR strategies tend to select very notator who, at the beginning of the process, has available a short sentences (two/three words) composed just by named training set T r0 consisting of α manually tagged sentences, entities. This allows gathering a lot of different instances and an “unlabeled set” U0 consisting of |T r| − α untagged of named entities, but without a context of use, which is sentences. The annotator generates a classifier Φ̂0 from T r0 , important in order to learn how to perform extraction from uses it to tag the sentences in U0 , asks the active learning longer, more articulated sentences. agent to rank them, manually labels the top-ranked ones for The MTC strategy selects sentences of variable length, a total of roughly γ tokens, generates a new classifier Φ̂1 but tends to exceed in selecting sentences full of named en- from an augmented training set that comprises T r0 and the tities, thus with a very limited amount of O-tagged tokens. newly tagged sentences, and repeats this process β times. A common aspect of all the strategies is that, the more In our experiments we have set α = 110 (in the CoNLL similar two sentences are, the more similar are the scores 2003 dataset this means approximately 2000 tokens), β = that the various strategies assign them. If the dataset con- 20, and γ = 200; this means that each strategy will be eval- tains a lot of similar sentences, and such sentences obtain uated by testing the accuracy of the classifiers generated high scores, the contribution of relevant information to the from training sets consisting of approximately 2000, 2200, training set is limited, because of the redundancy contained . . . , 5800, 6000 training tokens, for a total 20 experiments in the set of sentences selected. per strategy. We think these parameters are realistic, since A comparison between the strategies based on round robin they simulate a situation in which (RRAS, RRMS, RRMTR) against the respective “single- rank” versions (MAS, MMS, MTR) shows that the RR- • there are only about 100 manually tagged sentences at strategies produce an improvement in the F1M measure, as the beginning; (this is reasonable, since in many ap- should be expected when using a class-balancing method as plications in which significantly more training data are RR. available, human annotators might not find it worth- The comparison of the averaging-based strategies (MAC, while to annotate any further); MAS, RRAS) against the respective versions based on max- • every time the human annotator manually labels 200 imization / minimization (MMC, MMS, RRMS) shows that unlabeled tokens, he/she wants to retrain the system; averaging always perform better than maximization / min- (this is reasonable, since he/she wants the operate on a imization. This indicates that the smoothing introduced by ranking of the unlabeled documents that incorporates the averaging helps the strategies to filter out the single as much as possible the feedback he/she has already “false-relevant” tokens that may appear in otherwise non- given to the system;) relevant sentences. • the human annotator does not want to do any further manual labeling once about 6,000 training tokens are 4. CONCLUSIONS available; (this seems reasonable, since at this point We have argued that, in active learning for information ex- the cost-effectiveness of the manual effort has probably traction, the sentence should be the unit of ranking. We decreased significantly.) have thus studied several strategies for scoring a given sen- As the baseline strategy for the evaluation of our results tence for ranking, based on the classification score and the we adopt the one that consists in adding further labeled confidence score obtained by each token in the sentence. On sentences to the training set by picking them at random. the positive side, the experimental results that we have ob- This simulates the behaviour of a human annotator that tained by testing these strategies on a named entity extrac- picks unlabeled sentences and labels them in no particular tion task show one such strategy (Min Average Confidence) order. to outperfom the others, irrespectively of learning device base MAC MAS MMC MMS RRAS RRMS MTR RRMTR MTC MedTR YamCha .650 .683 .583 .645 .530 .639 .596 .628 .607 .632 .555 F1µ CRF++ .656 .697 .610 .654 .463 .643 .573 .622 .509 .626 .544 YamCha .634 .661 .525 .633 .526 .644 .577 .593 .597 .613 .538 F1M CRF++ .639 .664 .546 .634 .473 .633 .564 .563 .519 .606 .517 Table 1: Values of F1 obtained after the last training session, i.e., with classifiers trained on approximately 2,000 training tokens plus approximately 4,000 tokens manually annotated as a result of the active learning strategy. Boldface indicates the best performance. used (support vector machines or conditional random fields) [8] Florian Laws and Hinrich Schütze. Stopping criteria and evaluation measure (microaveraged or macroaveraged for active learning of named entity recognition. In F1 ) used. On the negative side, the same results show that Proceedings of the 22nd International Conference on all the other strategies, that seem based on solid intuitions, Computational Linguistics (COLING’08), pages tend to be roughly equivalent to a random strategy. In the 465–472, Manchester, UK, 2008. future we plan to test these strategies further, possibly on [9] David D. Lewis and William A. Gale. A sequential IE tasks more difficult than named entity extraction such as algorithm for training text classifiers. In Proceedings of opinion extraction. the 17th ACM International Conference on Research and Development in Information Retrieval Acknowledgments (SIGIR’94), pages 3–12, Dublin, IE, 1994. We thank Emanuele Pianta and Roberto Zanoli for kindly [10] Emanuele Pianta and Roberto Zanoli. Tagpro: A providing us a version of the CoNLL corpus already prepro- system for Italian POS tagging based on SVMs. cessed with their Tagpro system [10]. Intelligenza Artificiale, 4(2):8–9, 2007. [11] Sunita Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261—377, 2008. 5. REFERENCES [1] Christopher J. Burges. A tutorial on support vector [12] Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and machines for pattern recognition. Data Mining and Chew-Lim Tan. Multi-criteria-based active learning Knowledge Discovery, 2(2):121–167, 1998. for named entity recognition. In Proceedings of the 42nd Meeting of the Association for Computational [2] Yejin Choi, Eric Breck, and Claire Cardie. Joint Linguistics (ACL’04), pages 589–596, Barcelona, ES, extraction of entities and relations for opinion 2004. recognition. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing [13] Simon Tong and Daphne Koller. Support vector (EMNLP’06), Sydney, AU, 2006. machine active learning with applications to text classification. Journal of Machine Learning Research, [3] Andrea Esuli, Michal Pryczek, and Fabrizio 2:45–66, 2001. Sebastiani. Evaluating information extraction systems. Technical report, Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, Pisa, IT, 2010. Forthcoming. [4] Andrea Esuli and Fabrizio Sebastiani. Active learning strategies for multi-label text classification. In Proceedings of the 31st European Conference on Information Retrieval (ECIR’09), pages 102–113, Toulouse, FR, 2009. [5] Andrea Esuli and Fabrizio Sebastiani. Enhancing opinion extraction by automatically annotated lexical resources. In Proceedings of the 4th Language Technology Conference (LTC’09), pages 224–228, Poznań, PL, 2009. [6] Rosie Jones, Rayid Ghani, Tom Mitchell, and Ellen Riloff. Active learning for information extraction with multiple view feature sets. In Proceedings of the Workshop on Adaptive Text Extraction and Mining (ATEM’03), number 18–25, Cavtat–Dubrovnik, KR, 2003. [7] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML’01), pages 282–289, Williamstown, US, 2001.