<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Microposts</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Part-of-Speech is (almost) enough: SAP Research &amp; Innovation at the #Microposts2014 NEEL Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Dahlmeier</string-name>
          <email>d.dahlmeier@sap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naveen Nandan</string-name>
          <email>naveen.nandan@sap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wang Ting</string-name>
          <email>dean.wang@sap.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SAP Research and Innovation</institution>
          ,
          <addr-line>#14 CREATE, 1 Create Way</addr-line>
          ,
          <country country="SG">Singapore</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>4</volume>
      <abstract>
        <p>This paper describes the submission of the SAP Research &amp; Innovation team at the #Microposts2014 NEEL Challenge. We use a two-stage approach for named entity extraction and linking, based on conditional random fields and an ensemble of search APIs and rules, respectively. A surprising result of our work is that part-of-speech tags alone are almost sufficient for entity extraction. Our results for the combined extraction and linking task on a development and test split of the training set are 34.6% and 37.2% F1 score, respectively, and for the test set is 37%.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Conditional Random Field</kwd>
        <kwd>Entity Extraction</kwd>
        <kwd>DBpedia Linking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper, we describe the submission of the SAP
Research &amp; Innovation team. Our system breaks the task into
two separate steps for extraction and linking. We use a
conditional random field (CRF) model for entity extraction
and an ensemble of search APIs and rules for entity linking.
We describe our method and present experimental results
based on the released training data. One surprising finding
of our experiments is that part-of-speech tags alone perform
almost as well as the best feature combinations for entity
extraction.</p>
      <p>Copyright c 2014 held by author(s)/owner(s); copying permitted
only for private and academic purposes.</p>
      <p>Published as part of the #Microposts2014 Workshop proceedings,
available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141)</p>
    </sec>
    <sec id="sec-2">
      <title>2. METHOD</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Extraction</title>
      <p>
        We use a sequence tagging approach for entity extraction. In
particular, we use a conditional random field (CRF) which
is a discriminative, probabilistic model for sequence data
with state-of-the-art performance [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A linear-chain CRF
tries to estimate the conditional probability of a label
sequence y given the observed features x, where each label yt
is conditioned on the previous label yt−1. In our case, we
use BIO CoNLL-style tags [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We do not differentiate
between different entity classes for BIO tags (e.g, ‘B’ instead
of ‘B-PERSON’).
      </p>
      <p>The choice of appropriate features can have a significant
impact on the model’s performance. We have investigated a
set of features that are commonly used for named entity
extraction. Table 1 lists the features. The casing features</p>
      <sec id="sec-3-1">
        <title>Feature</title>
        <p>words
words lower
POS
title case
upper case
stripped words
is number
word cluster
dbpedia</p>
      </sec>
      <sec id="sec-3-2">
        <title>Example</title>
        <p>Obamah
obamah</p>
        <p>ˆ
True</p>
        <p>False
obamah</p>
        <p>False
-NONEdbpedia.org/resource/Barack Obama
upper case and lower case and the is number feature are
implemented using simple regular expressions. The stripped
words feature is the lowercased word with initial hashtags
and @ characters removed. The DBpedia feature is
annotated automatically using the DBpedia Spotlight web API
1 and acts as a type of gazetteer feature. For a label yt at
position t, we consider features x extracted at the current
position t and previous position t−1. We experimented with
larger feature contexts but they did not improve the result
on the development set.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.2 Linking</title>
      <p>For the linking step, we explore different search APIs, such
as Wikipedia search2, DBpedia Spotlight, and Google search
to retrieve the DBpedia resource for a mention. We begin
with using the extracted entities individually as query terms
1github.com/dbpedia-spotlight/dbpedia-spotlight
2github.com/goldsmith/Wikipedia</p>
      <sec id="sec-4-1">
        <title>Feature</title>
        <p>POS
+ is number
+ upper case
to these search APIs. As ambiguity is a major concern for
the linking task, for tweets where there are multiple
entities extracted, we use the entities combined as an additional
query term. For example, a tweet with annotated entities
as Sean Hoare and phone hacking, Sean Hoare would map
to a specific resource in DBpedia but phone hacking could
refer to more than one resource. By using the query term
“phone hacking + Sean Hoare”, we can help boost the rank
for the resource “News International phone hacking scandal”
to map to the entity phone hacking instead of a general
article on “Phone Hacking”. In our system, we make use of
the Web APIs for Wikipedia search and DBpedia Spotlight
together with some hand-written rules to rank the resources
returned. The result of the ranking step is then used to
construct the DBpedia resource URL to which the entity is
mapped.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3. EXPERIMENTS AND RESULTS</title>
      <p>In this section, we present experimental results of our method,
based on the on the data released by the organizers.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Data sets</title>
      <p>
        We split the provided data set into a training (first 60%),
development (dev, next 20%), and test (dev-test, last 20%)
set. We perform standard pre-processing steps. We
perform tokenization and POS tagging using the Tweet NLP
toolkit [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], lookup word cluster indicators for each token
from the Brown clusters released by Turian et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and
annotate the tweets with the DBpedia Spotlight web API.
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Extraction</title>
      <p>
        We train the CRF model on the training set of the data,
perform feature selection based on the dev set, and test the
resulting model on the dev-test set. We evaluate the
resulting models using precision, recall, and F1 score. In all
experiments, we use the CRF++ implementation of
conditional random fields3 with default parameters. We found in
initial experiments that the CRF parameters did not have a
great effect on the final score. We employ a greedy feature
selection method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to find the subset of the best features.
Table 2 shows the results of the feature selection
experiments on the development set. We can see that POS tags
alone give a F1 score of 62.2%. Adding the binary is
number feature increases the score to 62.9%. Additional features,
such as lexical features, word clusters, or the DBpedia
Spotlight annotations, do not help and even decrease the score.
Surprisingly the word token itself is not selected as one of
the features. Thus, the CRF performs its task without even
looking at the word itself! After feature selection, we
retrain the CRF with the best performing feature set {POS,
is number } and evaluate the model on the dev and dev-test
set. The results are shown in Table 3.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3.3 Linking</title>
      <p>To test our linking system, we follow two approaches. First,
we measure the accuracy of the linking system using the
gold standard where we observe an accuracy of 67.6%. As
a second step, we combine the linking step with our entity
extraction step and measure the F1 score. Table 4 shows
the results on the dev and dev-test split for the combined
system.</p>
      <sec id="sec-8-1">
        <title>Data set</title>
        <p>Dev
Dev-test</p>
      </sec>
      <sec id="sec-8-2">
        <title>Precision</title>
        <p>0.436
0.477</p>
      </sec>
      <sec id="sec-8-3">
        <title>Recall</title>
        <p>0.287
0.304</p>
      </sec>
      <sec id="sec-8-4">
        <title>F1 score</title>
        <p>0.346
0.372</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. CONCLUSION</title>
      <p>We have described the submission of the SAP Research &amp;
Innovation team to the #Microposts2014 NEEL shared task.
Our system is based on a CRF sequence tagging model for
entity extraction and an ensemble of search APIs and rules
for entity linking. Our experiments show that POS tags
are a surprisingly effective feature for entity extraction in
tweets.</p>
    </sec>
    <sec id="sec-10">
      <title>5. ACKNOWLEDGEMENT</title>
      <p>The research is partially funded by the Economic
Development Board and the National Research Foundation of
Singapore.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. E. Cano</given-names>
            <surname>Basave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stankovic</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-S.</given-names>
            <surname>Dadzie</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2014) Named Entity Extraction &amp; Linking Challenge</article-title>
          .
          <source>In Proc., #Microposts2014</source>
          , pages
          <fpage>54</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.L.</given-names>
            <surname>Berger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.J. Della</given-names>
            <surname>Pietra</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.A. Della</given-names>
            <surname>Pietra</surname>
          </string-name>
          .
          <article-title>A maximum entropy approach to natural language processing</article-title>
          .
          <source>Computational linguistics</source>
          ,
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lafferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          .
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Owoputi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <article-title>Improved part-of-speech tagging for online conversational text with word clusters</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.T.K.</given-names>
            <surname>Sang and F. De Meulder</surname>
          </string-name>
          .
          <article-title>Introduction to the conll-2003 shared task: Language-independent named entity recognition</article-title>
          .
          <source>In Proceedings of HLT-NAACL</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Turian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ratinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Word representations: a simple and general method for semi-supervised learning</article-title>
          .
          <source>In Proceedings of ACL</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>