<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Concept Extraction Challenge: University of Twente at #MSM2013</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mena B. Habib</string-name>
          <email>m.b.habib@ewi.utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurice van Keulen</string-name>
          <email>m.vankeulen@ewi.utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of EEMCS, University of Twente</institution>
          ,
          <addr-line>Enschede</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1019</volume>
      <fpage>17</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>Twitter messages are a potentially rich source of continuously and instantly updated information. Shortness and informality of such messages are challenges for Natural Language Processing tasks. In this paper we present a hybrid approach for Named Entity Extraction (NEE) and Classification (NEC) for tweets. The system uses the power of the Conditional Random Fields (CRF) and the Support Vector Machines (SVM) in a hybrid way to achieve better results. For named entity type classification we use AIDA [8] disambiguation system to disambiguate the extracted named entities and hence find their type.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        2.1
Conditional Random Fields CRF is a probabilistic model that is widely used for
NER [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Despite the successes of CRF, the standard training of CRF can be very
expensive [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] due to the global normalization. In this task, we used an alternative method
called empirical training [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to train a CRF model. The maximum likelihood estimation
(MLE) of the empirical training has a closed form solution, and it does not need iterative
optimization and global normalization. So empirical training can be radically faster than
the standard training. Furthermore, the MLE of the empirical training is also a MLE of
the standard training. Hence it can obtain competitive precision to the standard training.
Tweet text is tokenized using special tweets tokenizer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For each token, the following
features are extracted and used to train the CRF: (a) The Part of Speech (POS) tag of the
word provided by a special POS tagger designed for tweets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. (b) If the word initial
character is capitalized or not. (c) If the word characters are all capitalized or not.
Support Vector Machines SVM is a machine learning approach used for classification
and regression problems. For our task, we used SVM to classify if a tweet segment is a
named entity or not. The training process takes the following steps:
1. Tweet text is segmented using the segmentation approach as described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Each
segment is considered a candidate for a named entity. We enriched the segments by
looking up a Knowledge-Base (KB) (here we use YAGO [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) for entity mentions
as described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The purpose of this step is to achieve high recall. To improve
the precision, we applied filtering hypotheses (such as removing segments that are
composed of stop words or having verb POS).
2. For each tweet segment, we extract the following set of features in addition to those
features mentioned in section 2.1: (a) The joint and the conditional probability of
the segment obtained from Microsoft Web N-Gram services [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. (b) The stickiness
of the segment as described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. (c) The segment frequency over around 5 million
tweets 1. (d) If the segment appears in WordNet. (e) If the segment appears as a
mention in Yago KB. (f) AIDA disambiguation system score for the disambiguated
entity of that segment (if any).
      </p>
      <p>
        The selection of the SVM features is based on the claim that disambiguation clues
can help in deciding if the segment is a mention for an entity or not [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
3. An SVM with RBF kernel is trained whether the candidate segment represents a
mention of NE or not.
      </p>
      <p>We take the union of the CRF and SVM results, after removing duplicate extractions,
to get the final set of annotations. For overlapping extractions we select the entity that
appears in Yago, then the one having longer length.
2.2</p>
      <p>Named Entity Classification
The purpose of NEC is to assign the extracted mention to its correct entity type. For
this task, we first use the prior type probability of the given mention in the training
1 http://wis.ewi.tudelft.nl/umap2011/ + TREC 2011 Microblog track
collection.
Twiner Seg.</p>
      <p>Yago</p>
    </sec>
    <sec id="sec-2">
      <title>Twiner∪Yago</title>
    </sec>
    <sec id="sec-3">
      <title>Filter(Twiner∪Yago)</title>
      <p>SVM
CRF
CRF∪SVM
CRF
AIDA Disambiguation
+ Entity Categorization
data. If the extracted mention is out of vocabulary (does not appear in training set), we
apply AIDA disambiguation system on the extracted mentions. AIDA provides the most
probable entity for the mention. We get the Wikipedia categories of that entity from the
KB to form an entity profile. Similarly, we use the training data to build a profile of
Wikipedia categories for each of the entity types (PER, ORG, LOC and MISC).</p>
      <p>To find the type of the extracted mention, we measure the document similarity
between the entity profile and the profiles of the 4 entity types. We assign the mention to
the type of the most similar profile.</p>
      <p>If the extracted mention is out of vocabulary and is not assigned to an entity by
AIDA we try to disambiguate the first token of it. If all those methods failed to find
entity type for the mention we just assign ”PER” type.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>
        In this section we show our experimental results of the proposed approaches on the
training data. All our experiments are done through a 4-fold cross validation approach
for training and testing. We used Precision, Recall and F1 measures as evaluation
criteria for those results. Table 1 shows the NEE results along the extraction process
phases. Twiner Seg. represents results of the tweet segmentation algorithm described
in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Yago represents results of the surface matching extraction as described in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Twiner∪Yago represents results of merging the output of the two aforementioned
methods. Filter(Twiner∪Yago) represents results after applying filtering hypothesis. The
purpose of those steps is to achieve as much recall as possible with reasonable
precision. SVM is trained as described in section 2.1 to find which of the segments represent
true NE. CRF is trained and tested on tokenized tweets to extract any NE regardless
of its type . CRF∪SVM is the unionized set of results of both CRF and SVM. Table
2 shows the final results of both extraction with CRF∪SVM and entity classification
using the method presented in section 2.2 (AIDA Disambiguation + Entity
Categorization). It also shows the CRF results when trained to recognize (extract and classify)
NE. We considered it as our baseline. Our method of separating the extraction and
classification outperforms the baseline.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we present our approach for the IE challenge. We split the NER task into
two separate tasks: NEE which aims only to detect entity mention boundaries in text;
and NEC which assigns the extracted mention to its correct entity type. For NEE we
used a hybrid approach of CRF and SVM to achieve better results. For NEC we used
AIDA disambiguation system to disambiguate the extracted named entities and hence
find their type.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yogatama</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Flanigan</surname>
            , and
            <given-names>N. A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Part-of-speech tagging for twitter: annotation, features, and experiments</article-title>
          .
          <source>In Proc. of the 49th ACL conference, HLT '11</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>M. B. Habib and M. van Keulen</surname>
          </string-name>
          .
          <article-title>Unsupervised improvement of named entity extraction in short informal context using disambiguation clues</article-title>
          .
          <source>In Proc. of the Workshop on Semantic Web and Information Extraction (SWAIE)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.</given-names>
            <surname>Hoffart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Berberich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Kelham</surname>
          </string-name>
          , G. de Melo, and
          <string-name>
            <surname>G. Weikum.</surname>
          </string-name>
          <article-title>Yago2: Exploring and querying world knowledge in time, space, context, and many languages</article-title>
          .
          <source>In Proc. of WWW</source>
          <year>2011</year>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Datta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.-S.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Twiner: named entity recognition in targeted twitter stream</article-title>
          .
          <source>In Proc. of the 35th ACM SIGIR conference, SIGIR '12</source>
          , pages
          <fpage>721</fpage>
          -
          <lpage>730</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons</article-title>
          .
          <source>In Proc. of the 7th HLT-NAACL conference, CONLL '03</source>
          , pages
          <fpage>188</fpage>
          -
          <lpage>191</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          .
          <article-title>Piecewise training of undirected models</article-title>
          .
          <source>In Proc. of UAI</source>
          , pages
          <fpage>568</fpage>
          -
          <lpage>575</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Thrasher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Viegas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.-j. P.</given-names>
            <surname>Hsu</surname>
          </string-name>
          .
          <article-title>An overview of microsoft web n-gram corpus and applications</article-title>
          .
          <source>In Proc. of the NAACL HLT</source>
          <year>2010</year>
          , pages
          <fpage>45</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Yosef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hoffart</surname>
          </string-name>
          , I. Bordino,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spaniol</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum. Aida</surname>
          </string-name>
          :
          <article-title>An online tool for accurate disambiguation of named entities in text and tables</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>12</issue>
          ):
          <fpage>1450</fpage>
          -
          <lpage>1453</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M. G.</given-names>
            <surname>Apers</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Wombacher</surname>
          </string-name>
          .
          <article-title>Closed form maximum likelihood estimator of conditional random fields</article-title>
          .
          <source>Technical Report TR-CTIT-13-03</source>
          , University of Twente,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>