MSM2013 IE Challenge: Annotowatch

    Stefan Dlugolinsky, Peter Krammer, Marek Ciglan, and Michal Laclavik

              Institute of Informatics, Slovak Academy of Sciences
             Dubravska cesta 9, 845 07 Bratislava, Slovak Republic
{stefan.dlugolinsky,peter.krammer,marek.ciglan,michal.laclavik}@savba.sk
                              http://ikt.ui.sav.sk


       Abstract. In this paper, we describe our approach taken in the MSM2013
       IE Challenge, which was aimed at concept extraction from microposts.
       The goal of the approach was to combine several existing NER tools
       which use different classification methods and benefit from their com-
       bination. Several NER tools have been chosen and individually eval-
       uated on the challenge training set. We observed that some of these
       tools performed better on different entity types than other tools. In ad-
       dition, different tools produced diverse results which brought a higher
       recall when combined than that of the best individual tool. As expected,
       the precision significantly decreased. The main challenge was in combin-
       ing annotations extracted by diverse tools. Our approach was to exploit
       machine-learning methods. We have constructed feature vectors from
       the annotations yielded by different extraction tools and various text
       characteristics, and we have used several supervised classifiers to train
       the classification models. The results showed that several classification
       models have achieved better results than the best individual extractor.

       Keywords: Information extraction, machine-learning, named entity recog-
       nition, microposts


1    Introduction
Most of the current Named Entity Recognition (NER) methods have been de-
signed for concept extraction from relatively long and grammatically correct
texts, such as newswire texts or biomedical texts. More and more user-generated
content on the Web consists of a relatively short text which is often grammat-
ically incorrect (e.g., microposts, on which these methods perform worse). The
goal of the approach proposed in this paper is to combine several different infor-
mation extraction methods in order to reach a more precise concept extraction
on relatively short texts. We hypothesized that if these methods were combined
properly, they would perform better than the best individual method from the
pool. This assumption was partially proven through the evaluation of several
available and well-known NER tools that use different entity extraction meth-
ods. The merged results of these tools showed a higher recall than that of the
best tool but with a very low precision. The goal was to reduce or eliminate
this tradeoff. Higher recall indicates that different methods complement each


 Copyright c 2013 held by author(s)/owner(s). Published as part of the
       · #MSM2013 Workshop Concept Extraction Challenge Proceedings ·
 available online as CEUR Vol-1019, at: http://ceur-ws.org/Vol-1019
 Making Sense of Microposts Workshop @ WWW’13, May 13th 2013, Rio de Janeiro, Brazil
other and that there is room for improvement. We have tried various machine-
learning algorithms and built several models capable of producing results based
on concepts extracted by yielded tools. The goal was to produce a model with
the highest possible precision approximating the recall measured for unified ex-
tracted concepts. In the following sections, we describe the NER tools that have
been used and how they individually performed on the MSM2013 IE Challenge
(from here on referenced as “challenge”) training set (version 1.5). We briefly
describe the methodology of our investigation (i.e., how our solution was built).


2   Tools Used

Our solution incorporates several available well-known NER tools: Annie Named
Entity Recognizer [1], Apache OpenNLP1 , Illinois Named Entity Tagger (with 4-
label type model) [2], Illinois Wikifier [3], LingPipe (with English News - MUC-6
model)2 , Open Calais3 , Stanford Named Entity Recognizer (with 4 class caseless
model) [4], WikipediaMiner4 . This list is complemented by the Miscinator, a tool
specifically designed for the challenge. The Miscinator detects MISC concepts
(i.e., entertainment/award event, sports event, movies, TV shows, political event,
and programming languages). One of the tools’ evaluation conclusions was that
they were not performing well in detecting entertainment, award, and sports
events. Therefore, we built a specialized gazetteer annotation tool for this task.
The gazetteer has been constructed from the events annotations found in the
challenge training set extended by Google Sets service (a method trained on
web crawls) which generates list of items based on several examples. The only
customization made to listed tools was the mapping of their annotation types to
match target entity types (i.e., Location - LOC, Person - PER and Organization
ORG) as well as filtering unimportant ones (e.g., Token). Relevant OpenCalais
entities to target entities were similarly mapped. Illinois Wikifier was treated
a bit differently, as it provided annotations with Wikipedia concepts and the
yielded output did not comprise the type classification for the annotations. To
overcome this drawback, we mapped the annotations to the DBPedia knowledge
base and used DBPedia types associated with the given concepts to derive target
entity types. WikipediaMiner annotations were mapped the same way.


3   Evaluation of Used Tools

All of the tools used were evaluated on the challenge training set. There were
three ways of computing the Precision, Recall, and F1 metrics used. The first
method was strict (PS , RS and F1S ), which considered partially correct responses
as incorrect; however, the second, lenient, considered them as correct (PL , RL
1
  http://opennlp.apache.org
2
  http://alias-i.com/lingpipe
3
  http://www.opencalais.com/about
4
  http://wikipedia-miner.cms.waikato.ac.nz


                                                                               22
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
and F1L ). The third method was an average of the previous two (PA , RA , and
F1A ). The evaluation results are shown in Fig. 1. We also evaluated the unified
responses of all of the tools. Results showed that the recall was much higher
(RS = 90%) than the best individual tool (Illinois NER got RS = 60%), but
the precision was very poor (PS = 18%), hence the F1 score (F1S = 30%). The
best performing tool on microposts was OpenCalais, which scored PS = 70%,
RS = 58% and F1S = 64%.


                                                     PS	
  
                                          1.00	
  

                      F1A	
                                            RS	
  
                                          0.75	
  

                                          0.50	
  
                                                                                                   Annie	
  

                                          0.25	
                                                   Apache	
  OpenNLP	
  
    RA	
                                                                                 F1S	
     Illinois	
  NER	
  
                                                                                                   Illinois	
  Wikiﬁer	
  
                                          0.00	
  
                                                                                                   LingPipe	
  
                                                                                                   Open	
  Calais	
  
                                                                                                   Stanford	
  NER	
  
                                                                                                   WikipediaMiner	
  

             PA	
                                                               PL	
  

                                F1L	
                         RL	
  

                         Fig. 1. Micro summary of NER tools over training set v1.5


4            Machine Learning
Our goal was to create a model that would take the most relevant results detected
by each tool and perform better than the best tool did individually. We have
used statistical classifiers to achieve this goal.

4.1           Input Features
We have taken the approach of describing how particular extractors performed
on different entity types compared to the response of other extractors. Used


                                                                                                                             23
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
as a training vector, this description was an input for training a classification
model. A vector of input training features was generated for each annotation
found by integrated NER tools. We called this annotation a reference annota-
tion. The vector of each reference annotation consisted of several sub-vectors.
The first sub-vector of the training vector was an annotation vector. The an-
notation vector described the reference annotation – whether it was uppercase
or lowercase, used a capital first letter or capitalized all of its words, the word
count, and the type of the detected annotation (LOC, MISC, ORG, PER, NP
noun phrase, VP verb phrase, OTHER). The second sub-vector described mi-
croposts as a whole. It contained features describing whether all words longer
than four characters were capitalized, uppercase, or lowercase. The rest of the
sub-vectors were computed according to the overlap of the reference annota-
tion with annotations produced by other NER tools. Such sub-vector (termed a
method vector by us) was computed for each extractor and contained four other
vectors (average scores per named entity type) for each target entity type (LOC,
MISC, ORG, PER). The average score vector consisted of five components –
ail: the average intersection length of a reference annotation with annotations
produced by other extractors (from here on referenced as other’s annotations),
aiia: the average percentage intersection of other’s annotations with reference
annotation, aiir: the average percentage intersection of a reference annotation
with other’s annotations, average confidence (if the underlying extractors return
such value), and variance of the average confidence. The last component in the
training vector was the correct answer (i.e., the correct annotation type taken
from manual annotation).


4.2    Model Training


Several types of classification models were considered, especially tree-models
which allow the use of numerical and discrete attributes. Due to its large number
of trees, Random Forests looked very advisable and reliable during the first
round of testing. However, the increasing number of input attributes caused the
performance of Random Forests to degrade. Therefore, we used a single decision
tree generated by the C4.5 algorithm [5] as a simple alternative. The set of
training vectors was preprocessed before the model training. Duplicate rows were
removed from the training set and a randomize filter was applied to shuffle the
training vectors. The preprocessed training set contained approximately 35, 000
vectors, each consisting of 105 attributes. The trained model was represented by
a classification tree built by the J48 algorithm in Weka5 . J48 is also known as an
open-source implementation of the C4.5 algorithm with pruning. A Tenfold Fold
Cross Validation was used. This model provided classification into five discrete
classes (NULL, ORG, LOC, MISC, PER) for each record.

5
    http://www.cs.waikato.ac.nz/ml/weka/


                                                                                24
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
4.3    Estimated Performance of the Model

To get an idea of our model performance, we have trained the model on an
80% split of the challenge training set cleaned from duplicate records and have
evaluated it on the remaining 20% split. The evaluation results are displayed in
Table 1. We included the results from the best individually performing tools for
each entity type.


                  Table 1. Evaluation on the 20% training set split

          Illinois NER   Illinois Wikifier   Stanford NER    Miscinator   Annotowatch

          PS RS F1S       PS RS     F1S      PS RS F1S      PS RS F1S     PS RS F1S

LOC      54% 56% 55%     36% 44% 40%         55% 54% 55%     -   -    -   57% 56% 57%
MISC      4% 7% 5%       10% 18% 13%         2% 2% 2%       87% 44% 59%   55% 58% 57%
ORG      31% 35% 33%     60% 41% 49%         23% 28% 25%     -   -    -   64% 49% 56%
PER      86% 84% 85%     89% 56% 69%         83% 78% 81%     -   -    -   85% 87% 86%

ALL      62% 66% 64%     63% 49% 55%         60% 60% 60%    87% 4% 7%     77% 75% 76%


5     Runs Submitted

Three runs were submitted for evaluation in the challenge. The first run was
generated by the C4.5 algorithm trained model with parameter M denoting the
minimum number of instances per leaf set to 2. The second run was generated
by the model trained with parameter M set to 3. The third run was based on the
first run and involved specific post-processing. If a micropost identical to one in
the training set was annotated, we extended the detected concepts by those from
manually annotated training data (affecting three microposts). A gazetteer built
from a list of organizations found in the training set has been used to extend the
ORG annotations of the model (affecting 69 microposts). The models producing
the submission results were trained on a full challenge training set.


Acknowledgments. This work is supported by projects VEGA 2/0185/13,
VENIS FP7-284984, CLAN APVV-0809-11 and ITMS: 26240220072.


References

1. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework
   and Graphical Development Environment for Robust NLP Tools and Applications.
   In: Proceedings of the 40th Anniversary Meeting of the Association for Computa-
   tional Linguistics (ACL’02). (2002)


                                                                                   25
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
2. Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recog-
   nition. In: Proceedings of the Thirteenth Conference on Computational Natural
   Language Learning. CoNLL ’09, Stroudsburg, PA, USA, Association for Computa-
   tional Linguistics (2009) 147–155
3. Ratinov, L., Roth, D., Downey, D., Anderson, M.: Local and global algorithms for
   disambiguation to wikipedia. In: Proceedings of the 49th Annual Meeting of the
   Association for Computational Linguistics: Human Language Technologies - Vol-
   ume 1. HLT ’11, Stroudsburg, PA, USA, Association for Computational Linguistics
   (2011) 1375–1384
4. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into
   information extraction systems by gibbs sampling. In: Proceedings of the 43rd An-
   nual Meeting on Association for Computational Linguistics. ACL ’05, Stroudsburg,
   PA, USA, Association for Computational Linguistics (2005) 363–370
5. Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann Publishers
   Inc., San Francisco, CA, USA (1993)


                                                                                  26
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·