<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MSM2013 IE Challenge: Annotowatch</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefan Dlugolinsky</string-name>
          <email>stefan.dlugolinsky@savba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Krammer</string-name>
          <email>peter.krammer@savba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marek Ciglan</string-name>
          <email>marek.ciglan@savba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Laclavik</string-name>
          <email>michal.laclavik@savba.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Informatics, Slovak Academy of Sciences Dubravska cesta 9</institution>
          ,
          <addr-line>845 07 Bratislava</addr-line>
          ,
          <country>Slovak Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1019</volume>
      <fpage>21</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>In this paper, we describe our approach taken in the MSM2013 IE Challenge, which was aimed at concept extraction from microposts. The goal of the approach was to combine several existing NER tools which use different classification methods and benefit from their combination. Several NER tools have been chosen and individually evaluated on the challenge training set. We observed that some of these tools performed better on different entity types than other tools. In addition, different tools produced diverse results which brought a higher recall when combined than that of the best individual tool. As expected, the precision significantly decreased. The main challenge was in combining annotations extracted by diverse tools. Our approach was to exploit machine-learning methods. We have constructed feature vectors from the annotations yielded by different extraction tools and various text characteristics, and we have used several supervised classifiers to train the classification models. The results showed that several classification models have achieved better results than the best individual extractor.</p>
      </abstract>
      <kwd-group>
        <kwd>Information extraction</kwd>
        <kwd>machine-learning</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>microposts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Most of the current Named Entity Recognition (NER) methods have been
designed for concept extraction from relatively long and grammatically correct
texts, such as newswire texts or biomedical texts. More and more user-generated
content on the Web consists of a relatively short text which is often
grammatically incorrect (e.g., microposts, on which these methods perform worse). The
goal of the approach proposed in this paper is to combine several different
information extraction methods in order to reach a more precise concept extraction
on relatively short texts. We hypothesized that if these methods were combined
properly, they would perform better than the best individual method from the
pool. This assumption was partially proven through the evaluation of several
available and well-known NER tools that use different entity extraction
methods. The merged results of these tools showed a higher recall than that of the
best tool but with a very low precision. The goal was to reduce or eliminate
this tradeoff. Higher recall indicates that different methods complement each
other and that there is room for improvement. We have tried various
machinelearning algorithms and built several models capable of producing results based
on concepts extracted by yielded tools. The goal was to produce a model with
the highest possible precision approximating the recall measured for unified
extracted concepts. In the following sections, we describe the NER tools that have
been used and how they individually performed on the MSM2013 IE Challenge
(from here on referenced as “challenge”) training set (version 1.5). We briefly
describe the methodology of our investigation (i.e., how our solution was built).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Tools Used</title>
      <p>
        Our solution incorporates several available well-known NER tools: Annie Named
Entity Recognizer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Apache OpenNLP1, Illinois Named Entity Tagger (with
4label type model) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Illinois Wikifier [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], LingPipe (with English News - MUC-6
model)2, Open Calais3, Stanford Named Entity Recognizer (with 4 class caseless
model) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], WikipediaMiner4. This list is complemented by the Miscinator, a tool
specifically designed for the challenge. The Miscinator detects MISC concepts
(i.e., entertainment/award event, sports event, movies, TV shows, political event,
and programming languages). One of the tools’ evaluation conclusions was that
they were not performing well in detecting entertainment, award, and sports
events. Therefore, we built a specialized gazetteer annotation tool for this task.
The gazetteer has been constructed from the events annotations found in the
challenge training set extended by Google Sets service (a method trained on
web crawls) which generates list of items based on several examples. The only
customization made to listed tools was the mapping of their annotation types to
match target entity types (i.e., Location - LOC, Person - PER and Organization
ORG) as well as filtering unimportant ones (e.g., Token). Relevant OpenCalais
entities to target entities were similarly mapped. Illinois Wikifier was treated
a bit differently, as it provided annotations with Wikipedia concepts and the
yielded output did not comprise the type classification for the annotations. To
overcome this drawback, we mapped the annotations to the DBPedia knowledge
base and used DBPedia types associated with the given concepts to derive target
entity types. WikipediaMiner annotations were mapped the same way.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation of Used Tools</title>
      <p>All of the tools used were evaluated on the challenge training set. There were
three ways of computing the Precision, Recall, and F1 metrics used. The first
method was strict (PS , RS and F1S ), which considered partially correct responses
as incorrect; however, the second, lenient, considered them as correct (PL, RL</p>
      <sec id="sec-3-1">
        <title>1 http://opennlp.apache.org 2 http://alias-i.com/lingpipe 3 http://www.opencalais.com/about 4 http://wikipedia-miner.cms.waikato.ac.nz</title>
        <p>and F1L). The third method was an average of the previous two (PA, RA, and
F1A). The evaluation results are shown in Fig. 1. We also evaluated the unified
responses of all of the tools. Results showed that the recall was much higher
(RS = 90%) than the best individual tool (Illinois NER got RS = 60%), but
the precision was very poor (PS = 18%), hence the F1 score (F1S = 30%). The
best performing tool on microposts was OpenCalais, which scored PS = 70%,
RS = 58% and F1S = 64%.</p>
        <p>RA  </p>
        <p>PA  </p>
        <p>F1A  </p>
        <p>RS  
1.00  
0.75  
0.50  
0.25  
0.00  </p>
        <p>PS  </p>
        <p>PL  </p>
        <p>F1S  </p>
        <p>Annie  
Apache  OpenNLP  
Illinois  NER  
Illinois  Wikifier  
LingPipe  
Open  Calais  
Stanford  NER  
WikipediaMiner  
F1L  </p>
        <p>RL  
Our goal was to create a model that would take the most relevant results detected
by each tool and perform better than the best tool did individually. We have
used statistical classifiers to achieve this goal.
as a training vector, this description was an input for training a classification
model. A vector of input training features was generated for each annotation
found by integrated NER tools. We called this annotation a reference
annotation. The vector of each reference annotation consisted of several sub-vectors.
The first sub-vector of the training vector was an annotation vector. The
annotation vector described the reference annotation – whether it was uppercase
or lowercase, used a capital first letter or capitalized all of its words, the word
count, and the type of the detected annotation (LOC, MISC, ORG, PER, NP
noun phrase, VP verb phrase, OTHER). The second sub-vector described
microposts as a whole. It contained features describing whether all words longer
than four characters were capitalized, uppercase, or lowercase. The rest of the
sub-vectors were computed according to the overlap of the reference
annotation with annotations produced by other NER tools. Such sub-vector (termed a
method vector by us) was computed for each extractor and contained four other
vectors (average scores per named entity type) for each target entity type (LOC,
MISC, ORG, PER). The average score vector consisted of five components –
ail: the average intersection length of a reference annotation with annotations
produced by other extractors (from here on referenced as other’s annotations),
aiia: the average percentage intersection of other’s annotations with reference
annotation, aiir: the average percentage intersection of a reference annotation
with other’s annotations, average confidence (if the underlying extractors return
such value), and variance of the average confidence. The last component in the
training vector was the correct answer (i.e., the correct annotation type taken
from manual annotation).
4.2</p>
        <sec id="sec-3-1-1">
          <title>Model Training</title>
          <p>
            Several types of classification models were considered, especially tree-models
which allow the use of numerical and discrete attributes. Due to its large number
of trees, Random Forests looked very advisable and reliable during the first
round of testing. However, the increasing number of input attributes caused the
performance of Random Forests to degrade. Therefore, we used a single decision
tree generated by the C4.5 algorithm [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] as a simple alternative. The set of
training vectors was preprocessed before the model training. Duplicate rows were
removed from the training set and a randomize filter was applied to shuffle the
training vectors. The preprocessed training set contained approximately 35, 000
vectors, each consisting of 105 attributes. The trained model was represented by
a classification tree built by the J48 algorithm in Weka5. J48 is also known as an
open-source implementation of the C4.5 algorithm with pruning. A Tenfold Fold
Cross Validation was used. This model provided classification into five discrete
classes (NULL, ORG, LOC, MISC, PER) for each record.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>5 http://www.cs.waikato.ac.nz/ml/weka/</title>
        <p>4.3</p>
        <sec id="sec-3-2-1">
          <title>Estimated Performance of the Model</title>
          <p>To get an idea of our model performance, we have trained the model on an
80% split of the challenge training set cleaned from duplicate records and have
evaluated it on the remaining 20% split. The evaluation results are displayed in
Table 1. We included the results from the best individually performing tools for
each entity type.
Three runs were submitted for evaluation in the challenge. The first run was
generated by the C4.5 algorithm trained model with parameter M denoting the
minimum number of instances per leaf set to 2. The second run was generated
by the model trained with parameter M set to 3. The third run was based on the
first run and involved specific post-processing. If a micropost identical to one in
the training set was annotated, we extended the detected concepts by those from
manually annotated training data (affecting three microposts). A gazetteer built
from a list of organizations found in the training set has been used to extend the
ORG annotations of the model (affecting 69 microposts). The models producing
the submission results were trained on a full challenge training set.
Acknowledgments. This work is supported by projects VEGA 2/0185/13,
VENIS FP7-284984, CLAN APVV-0809-11 and ITMS: 26240220072.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cunningham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tablan</surname>
          </string-name>
          , V.:
          <article-title>GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications</article-title>
          .
          <source>In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02)</source>
          . (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ratinov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Design challenges and misconceptions in named entity recognition</article-title>
          .
          <source>In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning</source>
          .
          <source>CoNLL '09</source>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA, Association for Computational Linguistics (
          <year>2009</year>
          )
          <fpage>147</fpage>
          -
          <lpage>155</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ratinov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Downey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Local and global algorithms for disambiguation to wikipedia</article-title>
          . In:
          <article-title>Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1</article-title>
          . HLT '
          <volume>11</volume>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA, Association for Computational Linguistics (
          <year>2011</year>
          )
          <fpage>1375</fpage>
          -
          <lpage>1384</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grenager</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Incorporating non-local information into information extraction systems by gibbs sampling</article-title>
          .
          <source>In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. ACL '05</source>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA, Association for Computational Linguistics (
          <year>2005</year>
          )
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Quinlan</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <source>C4</source>
          .
          <article-title>5: programs for machine learning</article-title>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>