Sentence-Based Active Learning Strategies
                           for Information Extraction

                            Andrea Esuli, Diego Marcheggiani∗ and Fabrizio Sebastiani
                                            Istituto di Scienza e Tecnologie dell’Informazione
                                                    Consiglio Nazionale delle Ricerche
                                               Via Giuseppe Moruzzi, 1 – 56124 Pisa, Italy
                                                     firstname.lastname@isti.cnr.it

ABSTRACT                                                               tagset of interest, the predicted sequences of tokens coincide
Given a classifier trained on relatively few training exam-            with the true sequences.
ples, active learning (AL) consists in ranking a set of un-               In text classification and other text learning tasks differ-
labeled examples in terms of how informative they would                ent from IE, the units of ranking and the units of annotation
be, if manually labeled, for retraining a (hopefully) better           are the same; e.g., in text classification, it is the texts them-
classifier. An important text learning task in which AL is             selves that are ranked, and it is the texts themselves that
potentially useful is information extraction (IE), namely, the         are then annotated in their entirety by the human anno-
task of identifying within a text the expressions that instan-         tator. IE is peculiar from this standpoint since, while the
tiate a given concept. We contend that, unlike in other text           units of annotation are the tokens, it does not make sense to
learning tasks, IE is unique in that it does not make sense            rank individual tokens: if this were to happen, an annotator
to rank individual items (i.e., word occurrences) for anno-            would be presented with “tokens in context” (i.e., a token in
tation, and that the minimal unit of text that is presented            the fixed-size window of text in which the token occurs) and
to the annotator should be an entire sentence. In this paper           asked to annotate the token, with the consequence that she
we propose a range of active learning strategies for IE that           might be asked to read the same context several times, for
are based on ranking individual sentences, and experimen-              annotating neighbouring tokens.
tally compare them on a standard dataset for named entity                 In this paper we take the view that the optimal unit of
extraction.                                                            ranking is the sentence. This means that all the sentences
                                                                       of the automatically annotated texts are going to be ranked
                                                                       and presented to the annotator, who will then annotate all
Keywords                                                               the tokens of a few sentences, starting from the top-ranked
Information extraction, named entity recognition, active learn-        ones. This is different from several other works in the field
ing, selective sampling                                                [6, 8, 12], in which the unit of ranking is a portion of text
                                                                       smaller than a sentence, i.e., a predicted sequence embedded
1.     INTRODUCTION                                                    in a fixed-sized text window a few words long. The problem
In many applicative contexts involving supervised learning,            with the latter approach is that, by focusing on predicted se-
labeled data may be scarce or expensive to obtain, while               quences, the classification mistakes that the annotator cor-
unlabeled data, even sampled from the same distribution,               rects are the false positives, while the false negatives are
may abound. In such situations it may be useful to employ              never brought to the light. This results in an imbalanced
an algorithm that ranks the unlabeled examples and asks                training set being fed to the learner.
a human annotator to label a few of them, starting from                   We deem the sentence to be the optimal unit of ranking
the top-ranked ones, so as to provide additional highly in-            for additional reasons:
formative training data. The task of this algorithm is thus
to rank the unlabeled examples in terms of how informa-                   • An entire sentence offers more context for actually in-
tive they would be, once labeled, for the supervised learning               terpreting the tokens and the sequences within it than
task. The discipline that studies these algorithms is called                the fixed-size window often used in the literature. This
(pool-based) active learning (aka selective sampling). This                 is especially important in complex IE tasks such as
paper focuses on the application of active learning to infor-               opinion extraction (see e.g., [2, 5]), in which, given the
mation extraction (IE), the task of annotating sequences of                 variety of devices that language has for conveying opin-
one or more words (aka tokens) in a text by means of tags                   ions, and given the uncertain boundary between fact
representing concepts of interest. The hypothetically per-                  and opinion, the annotator needs to take very subtle
fect IE system is thus the one for which, for each tag in the               decisions.
∗
    Corresponding author                                                  • Different sentences never overlap, while different fixed-
                                                                            length windows may do. The sentence-based approach
                                                                            results in smaller annotation effort, since the same to-
                                                                            ken is never examined twice by the annotator.
Appears in the Proceedings of the 1st Italian Information Retrieval       • From a semantic point of view, sentences are fairly
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                      self-contained units. This means that using portions of
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
      text larger than sentences (e.g., paragraphs) as rank-         two separators, might be annotated with the PER (“person
      ing units is unnecessary, also given that it is hardly the     name”) tag. Such sequences of tokens will here be referred
      case that an annotation crosses the boundary between           to as annotated sequences (ASs); the expressions true AS
      two consecutive sentences. Conversely, with a fixed-           and predicted AS will refer to ASs according to Φ and Φ̂,
      size window centered around a predicted sequence, an-          respectively. Note that the reason for considering separa-
      other true sequence may cross the boundary between             tors to be the object of tagging too is that the IE system
      the window and its neighbouring text.                          should correctly identify sequence boundaries. For instance,
                                                                     given the expression “Barack Obama, Hillary Clinton and
In the past, typical strategies adopted in AL for generic            Joe Biden” the perfect IE system will attribute the PER tag,
learning tasks have relied on ranking objects based either           among others, to the tokens “Barack”, “Obama”, “Hillary”,
on the classification score attributed by the classifier to the      “Clinton”, and to the separators (in this case: blank spaces)
object (relevance sampling), or on the confidence score with         between “Barack” and “Obama” and between “Hillary” and
which the classifier has classified it (uncertainty sampling)        “Clinton”, but not to the separator “, ” between “Obama”
[9]. In IE, if we want to rank entire sentences we have to           and “Hillary”. If the IE system does so, this means that
come to terms with the fact that each token in the sentence          it has correctly identified the boundaries of the sequences
has obtained a classification and a confidence score for each        “Barack Obama” and “Hillary Clinton”.
tag in the previous classification round, and we thus have to           Note that “single-tag” IE means that each token (resp.,
generate a sentence-specific score out of the token- and tag-        separator) has exactly one tag. This is different from multi-
specific scores, for all the tokens contained in the sentence        tag IE, in which it is assumed that a given token (resp.,
and all the tags in the tagset.                                      separator) may have more than one tag (opinion extraction
   The main contribution of this paper consists in proposing         – see e.g., [5] – is an instance of multi-tag IE).
several alternative strategies for combining the token- and
tag-specific scores into a sentence-specific score, and com-         2.2     Sentence-Based AL strategies for IE
paring these strategies experimentally.                              Our experimental work is focused on comparing a range of
   We remark that this paper does not deal with active learn-        active learning strategies for IE that are based on ranking
ing algorithms for specific supervised learning devices (such        individual sentences. This section describes the strategies
as e.g., [13] for text classification), but presents active learn-   and the intuitions supporting them.
ing strategies that are independent of the learning device              In this work we test two alternative learning devices, sup-
and that are thus in principle suitable for use with any such        port vector machines (SVMs) (see e.g., [1]), and conditional
device.                                                              random fields (CRFs) [7]. For SVMs we have adopted a
   The rest of the paper is organized as follows. Our strate-        widely used method to realize a multiclass classifier as a com-
gies for performing AL in IE are described in Section 2. In          bination of binary classifiers, i.e., a one versus all method.
Section 3 we move to describing our experiments and the              The one versus all method consists in learning m binary clas-
experimental protocol we have followed. We conclude in               sifiers Φ̂c : T → R, each one trained using as the positive
Section 4 by pointing out avenues for future work.                   examples all the tokens in the training set T r that are la-
                                                                     beled with c, and as negative examples all the other tokens,
2.    ACTIVE LEARNING STRATEGIES FOR                                 regardless of the original label. The multiclass classifier is
      INFORMATION EXTRACTION                                         then defined as Φ̂(t) = arg maxc∈C∪{c∅ } Φ̂c (t), i.e., the as-
                                                                     signed label is the one whose binary classifier scored the
2.1    Preliminaries: Information Extraction                         maximum confidence.
                                                                        CRFs are a discriminative probabilistic learning method
This paper focuses on the application of active learning to
                                                                     based on an undirected graph model, and is frequently used
(single-tag) information extraction (STIE, or simply IE).
                                                                     for labeling sequential data, e.g., a sequence of words com-
Let a text T consist of a sequence T = {t1 ≺ s1 ≺ . . . ≺
                                                                     posing a text. Given a token t, a CRFs classifier estimates
sn−1 ≺ tn } of tokens (i.e., word occurrences) and separators
                                                                     the likelihood Φ̂c (t) = P (c|t) for each c ∈ C ∪{c∅ } and, simi-
(i.e., sequences of blanks and punctuation symbols), where
                                                                     larly to SVMs, the assigned label is the one scoring the high-
“≺” means “precedes in the text”. Let C = {c1 , . . . , cm }
                                                                     est Φ̂c (t) value. CRFs are nowadays considered the state-
be a predefined set of tags (aka labels, or classes), and let
                                                                     of-the-art learning device for information extraction [11].
c∅ 6∈ C be a special tag (to be read as “no tag”). We define
                                                                        The strategies we propose are based on two concepts,
(single-tag) information extraction as the task of estimating
                                                                     label score and tag score. The label score of a token is
an unknown target function Φ : T → C ∪ {c∅ } that specifies
                                                                     equal to ls(t) = maxc∈C∪{c∅ } Φ̂c (t), i.e., the maximum con-
the true tag in C ∪{c∅ } attached to each token ti ∈ T and to
                                                                     fidence score that determines the decision taken by the clas-
each separator si ∈ T . The result Φ̂ : T → C ∪ {c∅ } of this        sifier Φ̂(t). The tag score is instead defined as ts(t) =
estimation is called the tagger (or wrapper, or classifier )1 . A
                                                                     max{c∈C} Φ̂c (t), i.e., the maximum confidence that the clas-
further property of both Φ and Φ̂ is that they can attribute         sifier as on considering a token as belonging to a tag, regard-
a tag cj to a separator si only if they also attribute the same      less of the confidence with respect to c∅ .
tag to both ti−1 and ti .
   In most IE tasks it is actually the case that, rather than        2.2.1    Tag score-based strategies
isolated tokens and separators, sequences of consecutive to-         The following strategies are based on combining the label
kens and separators are annotated with a given tag; e.g.,            scores assigned to the tokens in the sentence, following the
the sequence “George W. Bush”, containing three tokens and           intuition that the elements on which the classifier has low
1                                                                    confidence could be more useful to the learner, so as to
  Consistently with most mathematical literature we use the
caret symbol (ˆ) to indicate estimation.                             gather knowledge on “difficult” cases.
  The Min Min Confidence (MMC) strategy assigns to the               The Round Robin Max Score (RRMS) strategy assigns,
sentence a value equal to the minimum tag score value among       for each c ∈ C, a relevance score to the sentence equal to
the tokens composing it, i.e.,                                    the maximum score obtained by the tokens contained in it,
                                                                  i.e.,
                 M M C(s) = mint∈s (ts(t))                 (1)
                                                                                    RRM Sc (s) = maxt∈s (Φ̂c (t))              (8)
Sentence ranking is performed in increasing order of M M C(s)
value.                                                            Then a round robin selection process is performed on the
   Min Average Confidence (MAC) is a version of MMC that          |C| rankings produced.
tries instead to be robust to single “extreme” evaluations,         Similarly to MAS, Round Robin Average Score (RRAS)
averaging the tag scores of all the tokens composing the          uses averaging instead of maximization, i.e.,
sentence, i.e.,                                                                     RRASc (s) = avgt∈s (Φ̂c (t))               (9)
                  M AC(s) = avgt∈s (ts(t))                 (2)    The Round Robin Max Tag Ratio (RRMTR) strategy ap-
                                                                  plies instead the MTR strategy considering the various tags
2.2.2    Label score-based strategies                             separately from each other, so as to avoid favouring the most
Symmetrically to the tag-score-based strategies, the label-       frequent tags over the most infrequent.
score-based strategies follow the somehow different intuition
that the elements on which the classifier has high confidence     3.     EXPERIMENTS
could be useful, so that the strong beliefs of the learner are
confirmed when correct or corrected when a blatant error is       3.1      Experimental setting
found.                                                            The dataset we have used for evaluating our strategies is the
  The Max Max Score (MMS) strategy assigns to each sen-           CoNLL2003 named entity extraction dataset. The dataset
tence a value equal to the highest label score among the          consists of 1,393 Reuters newswire articles, for a total of
tokens composing it, i.e.,                                        301,418 tokens. The tagset consists of 4 tags (LOC, PER,
                 M M S(s) = maxt∈s (ls(t))                 (3)    ORG, MISC, standing for “location”, “person”, “organiza-
                                                                  tion”, and “miscellaneous”, respectively) plus the special tag
Sentence ranking is performed in decreasing order of M AS(s).     O, which tags any token / separator not tagged by any tag
  Similarly to MAC, Max Average Score (MAS) instead               in {LOC, PER, ORG, MISC}. The tokens inside the cor-
averages the label scores of all the tokens composing the         pus are tagged as follows: 10,645 tokens are tagged as LOC,
sentence, i.e.,                                                   9,323 as ORG, 10,059 as PER, 5,062 as MISC, while the
                  M AS(s) = avgt∈s (ls(t))                 (4)    remaining 266,329 are tagged as O. We used a version of
                                                                  the CoNLL corpus already preprocessed with Pianta and
                                                                  Zanoli’s Tagpro system [10], a PoS-tagging system based on
2.2.3    Tag count-based strategies                               YamCha that computes features such as prefixes, suffixes,
The following strategies are instead based on counting the        orthographic information (e.g., capitalization, hyphenation)
number of tokens that are given a tag different from c∅ by        and morphological features, as well as PoS tags and chunk
the classifier.                                                   tags. These features altogether form the vectorial represen-
   The Max Tag Count (MTC) strategy counts the number             tations of tokens and separators that are fed to the learning
of tokens in the sentence that are given a tag different from     device.
c∅ , i.e.,                                                           For this latter, we have tested two alternative, off-the-shelf
               M T C(s) = |{t ∈ s|Φ̂(t) ∈ C}|              (5)    packages, i.e., YamCha2 and CRF++3 , respectively based
                                                                  on support vector machines and conditional random fields.
Sentence ranking is performed in decreasing order of M T C(s)        We evaluate the results of our experiments using the F1
value.                                                            measure on a token & separator evaluation model [3]. The
  Since MTC naturally favours long sentences, we have also        token & separator model considers each token and each sep-
tested a strategy (Max Tag Ratio – MTR) that normalizes           arator as being the objects of tagging; for instance, given
the values by sentence length, i.e.,                              tag c, the TP (“true positives”) entry of the contingency ta-
                            |{t ∈ s|Φ̂(t) ∈ C}|                   ble for c consists in the number of tokens that are correctly
               M T R(s) =                                  (6)    assigned token c plus the number of separators that are cor-
                                     |s|
                                                                  rectly assigned token c. Once the contingency tables for all
The Medium Tag Ratio (MedTR) strategy instead top-ranks           the tags in C have been filled, the evaluation is done by
the sentences with a tag ratio closer to the average tag ratio    using standard micro-averaged and macro-averaged F1 .
measured on the training set, i.e.,
                                                                  3.2      Experimental protocol
                                M T R(s)
              M edT R(s) =                                 (7)    In this work we adopt the following iterative experimental
                           avgs0 ∈T r M T R(s0 )                  protocol. The protocol has three integer parameters α, β,
                                                                  and γ. Let Ω be a set of natural language sentences parti-
2.2.4    Round Robin-based strategies                             tioned into a training set T r and a test set T e, and let σ be
While the previous strategies always combine the confidence       an active learning strategy:
values returned on the various tag types, the following strate-
gies are based on computing values separately for each tag,            1. Set an iteration counter t = 0;
                                                                  2
then selecting the most informative sentences using a “round        http://www.chasen.org/~taku/software/YamCha/
                                                                  3
robin” selection process across all the tags.                       http://crfpp.sourceforge.net/
  2. Set the current training set T rt to the set of the first α   3.3    Results
     sentences of T r; set the current “unlabeled set” Ut ←        The main results of our experiments are summarized in Ta-
     T r/T rt ;                                                    ble 1. This table reports, for each individual strategy, the
  3. For t = 1, . . . , β repeat the following steps:              values of F1µ and F1M obtained after 20 training sessions
                                                                   resulting from the protocol of Section 3.2, with α = 110,
      (a) Generate a classifier Φ̂t from the current training      β = 20, and γ = 200, using the two different learners, SVMs
          set T rt ;                                               and CRFs.
      (b) Evaluate the effectiveness of Φ̂t on T e;                   Quite surprisingly, the only genuine strategy that outper-
      (c) Classify Ut by means of Φ̂t ;                            forms the random baseline is the MAC strategy. The rel-
                                                                   ative improvement of MAC over RAND ranges from 3.9%
      (d) Rank Ut according to strategy σ, thus generating
                                                                   up to 6.3%. This improvement matches our expectations,
          the ranking σ(Ut );
                                                                   given the close relation between the MAC strategy with the
      (e) Let r(Ut , γ) be the smallest prefix of σ(Ut ) (i.e.,    uncertainty sampling [9] method which already proved to be
          the smallest number of top-ranked elements of            effective for AL.
          σ(Ut )) that contains at least γ tokens; set T rt+1 ←       Surprisingly, all the other strategies perform worse or no
          T rt ∪ r(Ut , γ); set Ut+1 ← Ut /r(Ut , γ).              better than the random baseline. In order to understand the
It is important to remark that Step 3b has only the purpose        possible motivations behind these results we have inspected
of collecting the results for experimental purposes (i.e., for     the sentences selected by the various strategies at the var-
producing the tables of Section 3.3); since it uses the test       ious iterations. This inspection allowed us to draw some
set T e, its results should obviously not be (and are not)         specific conclusions on some of the strategies, and a general
accessible to the algorithm.                                       observation for the entire pool of strategies.
    The above protocol simulates the activity of a human an-          The MTR and RRMTR strategies tend to select very
notator who, at the beginning of the process, has available a      short sentences (two/three words) composed just by named
training set T r0 consisting of α manually tagged sentences,       entities. This allows gathering a lot of different instances
and an “unlabeled set” U0 consisting of |T r| − α untagged         of named entities, but without a context of use, which is
sentences. The annotator generates a classifier Φ̂0 from T r0 ,    important in order to learn how to perform extraction from
uses it to tag the sentences in U0 , asks the active learning      longer, more articulated sentences.
agent to rank them, manually labels the top-ranked ones for           The MTC strategy selects sentences of variable length,
a total of roughly γ tokens, generates a new classifier Φ̂1        but tends to exceed in selecting sentences full of named en-
from an augmented training set that comprises T r0 and the         tities, thus with a very limited amount of O-tagged tokens.
newly tagged sentences, and repeats this process β times.             A common aspect of all the strategies is that, the more
    In our experiments we have set α = 110 (in the CoNLL           similar two sentences are, the more similar are the scores
2003 dataset this means approximately 2000 tokens), β =            that the various strategies assign them. If the dataset con-
20, and γ = 200; this means that each strategy will be eval-       tains a lot of similar sentences, and such sentences obtain
uated by testing the accuracy of the classifiers generated         high scores, the contribution of relevant information to the
from training sets consisting of approximately 2000, 2200,         training set is limited, because of the redundancy contained
. . . , 5800, 6000 training tokens, for a total 20 experiments     in the set of sentences selected.
per strategy. We think these parameters are realistic, since          A comparison between the strategies based on round robin
they simulate a situation in which                                 (RRAS, RRMS, RRMTR) against the respective “single-
                                                                   rank” versions (MAS, MMS, MTR) shows that the RR-
   • there are only about 100 manually tagged sentences at         strategies produce an improvement in the F1M measure, as
     the beginning; (this is reasonable, since in many ap-         should be expected when using a class-balancing method as
     plications in which significantly more training data are      RR.
     available, human annotators might not find it worth-             The comparison of the averaging-based strategies (MAC,
     while to annotate any further);                               MAS, RRAS) against the respective versions based on max-
   • every time the human annotator manually labels 200            imization / minimization (MMC, MMS, RRMS) shows that
     unlabeled tokens, he/she wants to retrain the system;         averaging always perform better than maximization / min-
     (this is reasonable, since he/she wants the operate on a      imization. This indicates that the smoothing introduced by
     ranking of the unlabeled documents that incorporates          the averaging helps the strategies to filter out the single
     as much as possible the feedback he/she has already           “false-relevant” tokens that may appear in otherwise non-
     given to the system;)                                         relevant sentences.

   • the human annotator does not want to do any further
     manual labeling once about 6,000 training tokens are          4.    CONCLUSIONS
     available; (this seems reasonable, since at this point
                                                                   We have argued that, in active learning for information ex-
     the cost-effectiveness of the manual effort has probably
                                                                   traction, the sentence should be the unit of ranking. We
     decreased significantly.)
                                                                   have thus studied several strategies for scoring a given sen-
As the baseline strategy for the evaluation of our results         tence for ranking, based on the classification score and the
we adopt the one that consists in adding further labeled           confidence score obtained by each token in the sentence. On
sentences to the training set by picking them at random.           the positive side, the experimental results that we have ob-
This simulates the behaviour of a human annotator that             tained by testing these strategies on a named entity extrac-
picks unlabeled sentences and labels them in no particular         tion task show one such strategy (Min Average Confidence)
order.                                                             to outperfom the others, irrespectively of learning device
                    base     MAC      MAS      MMC       MMS     RRAS     RRMS       MTR       RRMTR         MTC      MedTR
        YamCha      .650     .683     .583      .645     .530     .639     .596      .628        .607        .632      .555
 F1µ
        CRF++       .656     .697     .610      .654     .463     .643     .573      .622        .509        .626      .544
        YamCha      .634     .661     .525      .633     .526     .644     .577      .593        .597        .613      .538
 F1M
        CRF++       .639     .664     .546      .634     .473     .633     .564      .563        .519        .606      .517


Table 1: Values of F1 obtained after the last training session, i.e., with classifiers trained on approximately
2,000 training tokens plus approximately 4,000 tokens manually annotated as a result of the active learning
strategy. Boldface indicates the best performance.


used (support vector machines or conditional random fields)       [8] Florian Laws and Hinrich Schütze. Stopping criteria
and evaluation measure (microaveraged or macroaveraged                for active learning of named entity recognition. In
F1 ) used. On the negative side, the same results show that           Proceedings of the 22nd International Conference on
all the other strategies, that seem based on solid intuitions,        Computational Linguistics (COLING’08), pages
tend to be roughly equivalent to a random strategy. In the            465–472, Manchester, UK, 2008.
future we plan to test these strategies further, possibly on      [9] David D. Lewis and William A. Gale. A sequential
IE tasks more difficult than named entity extraction such as          algorithm for training text classifiers. In Proceedings of
opinion extraction.                                                   the 17th ACM International Conference on Research
                                                                      and Development in Information Retrieval
Acknowledgments                                                       (SIGIR’94), pages 3–12, Dublin, IE, 1994.
We thank Emanuele Pianta and Roberto Zanoli for kindly           [10] Emanuele Pianta and Roberto Zanoli. Tagpro: A
providing us a version of the CoNLL corpus already prepro-            system for Italian POS tagging based on SVMs.
cessed with their Tagpro system [10].                                 Intelligenza Artificiale, 4(2):8–9, 2007.
                                                                 [11] Sunita Sarawagi. Information extraction. Foundations
                                                                      and Trends in Databases, 1(3):261—377, 2008.
5.     REFERENCES
 [1] Christopher J. Burges. A tutorial on support vector         [12] Dan Shen, Jie Zhang, Jian Su, Guodong Zhou, and
     machines for pattern recognition. Data Mining and                Chew-Lim Tan. Multi-criteria-based active learning
     Knowledge Discovery, 2(2):121–167, 1998.                         for named entity recognition. In Proceedings of the
                                                                      42nd Meeting of the Association for Computational
 [2] Yejin Choi, Eric Breck, and Claire Cardie. Joint
                                                                      Linguistics (ACL’04), pages 589–596, Barcelona, ES,
     extraction of entities and relations for opinion
                                                                      2004.
     recognition. In Proceedings of the 2006 Conference on
     Empirical Methods in Natural Language Processing            [13] Simon Tong and Daphne Koller. Support vector
     (EMNLP’06), Sydney, AU, 2006.                                    machine active learning with applications to text
                                                                      classification. Journal of Machine Learning Research,
 [3] Andrea Esuli, Michal Pryczek, and Fabrizio
                                                                      2:45–66, 2001.
     Sebastiani. Evaluating information extraction systems.
     Technical report, Istituto di Scienza e Tecnologie
     dell’Informazione, Consiglio Nazionale delle Ricerche,
     Pisa, IT, 2010. Forthcoming.
 [4] Andrea Esuli and Fabrizio Sebastiani. Active learning
     strategies for multi-label text classification. In
     Proceedings of the 31st European Conference on
     Information Retrieval (ECIR’09), pages 102–113,
     Toulouse, FR, 2009.
 [5] Andrea Esuli and Fabrizio Sebastiani. Enhancing
     opinion extraction by automatically annotated lexical
     resources. In Proceedings of the 4th Language
     Technology Conference (LTC’09), pages 224–228,
     Poznań, PL, 2009.
 [6] Rosie Jones, Rayid Ghani, Tom Mitchell, and Ellen
     Riloff. Active learning for information extraction with
     multiple view feature sets. In Proceedings of the
     Workshop on Adaptive Text Extraction and Mining
     (ATEM’03), number 18–25, Cavtat–Dubrovnik, KR,
     2003.
 [7] John Lafferty, Andrew McCallum, and Fernando
     Pereira. Conditional random fields: Probabilistic
     models for segmenting and labeling sequence data. In
     Proceedings of the 18th International Conference on
     Machine Learning (ICML’01), pages 282–289,
     Williamstown, US, 2001.