<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Filter-Stream Named Entity Recognition: A Case Study at the MSM2013 Concept Extraction Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego Marinho de Oliveira</string-name>
          <email>dmoliveira@dcc.ufmg.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto H. F. Laender</string-name>
          <email>laender@dcc.ufmg.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriano Veloso</string-name>
          <email>adrianov@dcc.ufmg.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Altigran S. da Silva</string-name>
          <email>alti@icomp.ufam.edu.br</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidade Federal de Minas Gerais, Departamento de Ciˆencia da Computac ̧a ̃o</institution>
          ,
          <addr-line>Belo Horizonte</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidade Federal do Amazonas, Instituto de Computac ̧a ̃o</institution>
          ,
          <addr-line>Manaus</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>1019</volume>
      <fpage>71</fpage>
      <lpage>75</lpage>
      <abstract>
        <p>Microblog platforms such as Twitter are being increasingly adopted by Web users, yielding an important source of data for web search and mining applications. Tasks such as Named Entity Recognition are at the core of many of these applications, but the effectiveness of existing tools is seriously compromised when applied to Twitter data, since messages are terse, poorly worded and posted in many different languages. In this paper, we briefly describe a novel NER approach, called FS-NER (Filter Stream Named Entity Recognition) to deal with Twitter data, and present the results of a preliminary performance evaluation conducted to assess it in the context of the Concept Extraction Challenge proposed by the 2013 Workshop on Making Sense of Microposts - MSM2013. FS-NER is characterized by the use of filters that process unlabeled Twitter messages, being much more practical than existing supervised CRF-based approaches. Such filters can be combined either in sequence or in parallel in a flexible way. Our results show that, despite the simplicity of the filters used, our approach outperformed the baseline with improvements of 4.9% on average, while being much faster.</p>
      </abstract>
      <kwd-group>
        <kwd>Twitter</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>FS-NER</kwd>
        <kwd>CRF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In this paper, we briefly describe a novel NER approach, called FS-NER (Filter
Stream Named Entity Recognition), and present the results of a preliminary
performance evaluation conducted to assess it in the context of the Concept
Extraction Challenge proposed by the 2013 Workshop on Making Sense of
Microposts - MSM20133. Traditional approaches for Named Entity Recognition (NER)
have demonstrated to be successful when applied to data obtained from
typical Web documents, but they are ill suited to Twitter data [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], since Twitter
messages are composed of few words and usually written in informal, sometimes
cryptic style. FS-NER is an alternative NER approach better suited to deal with
Twitter data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this approach, the NER process is viewed as a coarse grain
Twitter message flow (i.e., a Twitter stream) controlled by a series of
components, referred to as filters. A filter receives a Twitter message coming on the
stream, performs specific processing in this message and returns information
about possible entities in the message (i.e., each filter is responsible to
recognize entities according to some specific criterion). Specifically, FS-NER employs
five lightweight filters, exploiting nouns, terms, affixes, context and dictionaries.
These filters are extremely fast and independent of grammar rules, and may be
combined in sequence (emphasizing precision) or in parallel (emphasizing recall).
      </p>
      <p>In our performance evaluation, we run a set of experiments using micropost
data made available by the challenge organizers. Our aim in this challenge was,
given a short message (i.e., a micropost), to recognize concepts generally defined
as “abstract notions of things”. Thus, for the purpose of the challenge our task
was constrained to the extraction of entity concepts found in micropost data,
characterised by a type and a value, and considering four entity types: Person,
Organization, Location and Miscellaneous. We also employed a state-of-the-art
CRF-based baseline. Our results show that, despite the simplicity of the filters
used, our approach outperformed the baseline with improvements of 4.9% on
average, while being much faster.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Approach</title>
      <p>
        FS-NER adopts filters that allow the execution of the NER task by dividing it
into several recognition processes in a distributed way. Furthermore, FS-NER
adopts a simple yet effective probabilistic analysis to choose the most suitable
label for the terms in the message being processed. Because of this lightweight
structure, FS-NER is able to process large amounts of data in real-time. In what
follows, we briefly describe the main FS-NER aspects involved. More details can
be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2.1
      </p>
      <p>Structure and Design
Let S = &lt; m1, m2, . . . &gt; be a stream of messages (i.e., tweets), where each mj
in S is expressed by a pair (X, Y ), being X a list of terms [x1, x2, . . . xn] that
compound mj and Y a list of labels [y1, y2, . . . , yn], such that each label yi is
associated with the corresponding term xi and assumes one of the values in the
set {Beginning, Inside, Last, Outside, UnitToken}. While X is known in advance
for all messages in S, the values for the labels in Y are unknown and must be
predicted. For example, the tweet “RT: I love Mary ” could be represented by
([x1 = RT:, x2 = I, x3 = love, x4 = M ary], [y1 = Outside, y2 = Outside, y3 =
Outside, y4 = U nitT oken]).</p>
      <p>To properly predict labels for Y , we need to provide representative data to
generate a recognition model. In FS-NER, a filter is a processing component that
estimates the probability of the labels associated with the terms of a message. A
set of features is used to support the training of the filters (such features include
information like the term itself, or if the first letter of the term is in uppercase). If
a term in X satisfies one of these features, we say that the corresponding filter is
activated by the term. Using the training set, we may count the number of times
a filter is activated by a given term, and by inspecting the corresponding label
we may calculate the likelihood of each pair {xi, yi} for each filter as expressed
by the equation</p>
      <p>P (yi = l|X ∧ F = k) = θl
(1)
where F is a random variable indicating that a filter k is being used and θl is
the probability of associating the label l with the term xi. The probability θl is
given by Equation 2, where T P is the number of true positive cases and F N is
the number of false negative cases for the term xi.
θl =</p>
      <p>T P</p>
      <p>T P + F N</p>
      <p>Thus, after trained, a filter becomes able to recognize entities present in
the upcoming messages. It is worth noting that each filter employs a different
recognition strategy (i.e., a different feature), and thus different predictions are
possible for different filters.</p>
      <p>In sum, filters are simple abstract models that receive as input a list of terms
X and a term xi ∈ X, and provides as output a set of labels with the associated
likelihood, denoted by {l, θl}. Thus, a filter can be defined by</p>
      <p>input output
(X, xi) −−−→ F −−−−→ {l, θl}.</p>
      <p>During the recognition step, the set {l, θl} is used to choose the most likely
label for the term xi. However, if used in isolation, filters may not capture specific
patterns that can be used for recognition. Fortunately, we may exploit filter
combinations to boost recognition performance. Specifically, we may combine
filters either in sequence (i.e., if we want to prioritize recognition precision), or in
parallel (i.e., if we want to prioritize recognition recall). If combined in sequence,
all filters must be activated by the input term, and the corresponding set {l, θl}
is obtained by treating the combined filters as an atomic one using Equation 1.
In this case, it is expected that filters when combined sequentially are able to
capture more specific patterns. In contrast, if combined in parallel, the combined
filters are not considered as an atomic one. Instead, they simply represent the
average of the corresponding likelihoods, as expressed by the equation
where Z(F ) is a normalization function that receives as input a list of filters F
and produces as output the number of filters activated by term xi.</p>
      <p>Once trained, the recognition models are used to select the most likely label
for each term in the upcoming messages.
(2)
(3)
In FS-NER, features are encapsulated by five basic filters. They are the term,
context, affix, dictionary and noun filters.</p>
      <p>The term filter estimates the probability of a certain term being an entity.
This estimation is based on the number of times a specific term has been assigned
as an entity during the training step. The context filter is specially important
since it is able to capture unknown entities. Hence, this filter analyzes only
the terms around an observed term xi considering a window of size n and infers
whether it is an entity or not. The affix filter uses the fragments of an observation
xi to infer if it is an entity. Advantageously, this filter can recognize entities that
have similar affix to the entities analyzed before. Thus, this filter makes use of
the prefix, infix or suffix of the observation to infer its label yi. The dictionary
filter uses lists of names of correlated entities to infer whether the observed term
is an entity. The dictionary is important to infer entities that do not appear in
the training data. The noun filter only considers terms that have just the first
letter capitalized to infer if the observed term is an entity.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>
        We performed the preliminary evaluation of our approach with the training
data made available for the MSM2013 Concept Extraction Challenge. This data
includes microposts that refer to entities of types Person (PER), Organization
(ORG), Location (LOC) and Miscellaneous (MISC). For this, we performed a
5fold cross validation. To reduce noise, we applied simple preprocessing techniques
like removing repeated letters and repeated adjacent terms within a micropost.
We also used additional labeled Twitter data from [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for improving recognition
results for entities of types PER and LOC. The standard filter combination
adopted for FS-NER was the generalized term filter combination that includes
all five proposed filters and presented the best performance in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the term
filter, the terms are case sensitive. The context filter, uses prefix and suffix
contexts with a window of size three, which presented the best result for F1
in all collections analyzed. The affix filter uses a prefix, infix and postfix size
of 1 to 3. The dictionary filter, specifically, uses the same lists of names of
correlated entities considered in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and others created from Wikipedia pages.
The CRF-based framework used as baseline was the one available at http://
crf.sourceforge.net, with features functionally similar to the FS-NER filters.
      </p>
      <p>Table 1 presents the obtained results. The line AVG-Diff shows the average
difference between the FS-NER and CRF-based framework results for all entity
types. These results show that, on average, FS-NER outperformed the
CRFbased framework by 4.9% for the F1 metric.</p>
      <p>Regarding the test dataset labeling, we followed the same procedure adopted
in the preliminary experiment discussed above. In addition, we trained our
approach for each entity type separately and then submitted all results together. In
case of any intersection between distinct entity types, we chose the entity type
that presented the most precise result among them (i.e., PER &gt; LOC &gt; ORG
&gt; MISC).</p>
      <p>Recall
PER
ORG
LOC
MISC</p>
      <p>FS-NER</p>
      <p>CRF
FS-NER</p>
      <p>CRF
FS-NER</p>
      <p>CRF
FS-NER</p>
      <p>CRF</p>
    </sec>
    <sec id="sec-4">
      <title>Concluision</title>
      <p>
        In this paper, we have briefly described a novel NER approach, called FS-NER
(Filter Stream Named Entity Recognition), and presented the results of a
performance evaluation conducted to assess it in the context of the Concept
Extraction Challenge proposed by the 2013 Workshop on Making Sense of Microposts
- MSM2013. In this challenge, our task was constrained to the extraction of
entity concepts found in micropost data, characterised by a type and a value, and
considering four entity types: Person, Organization, Location and Miscellaneous.
We also employed a state-of-the-art CRF-based baseline. Following previous
results [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], our approach outperformed the baseline with improvements of 4.9% on
average, while being much faster.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partially funded by InWeb - The Brazilian National Institute of
Science and Technology for the Web (grant MCT/CNPq 573871/2008-6), and
by the authors’ individual grants from CNPq, FAPEMIG and FAPEAM.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>D. M. de Oliveira</surname>
            ,
            <given-names>A. H. F.</given-names>
          </string-name>
          <string-name>
            <surname>Laender</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Veloso</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. S.</surname>
          </string-name>
          da Silva.
          <article-title>FS-NER: A Lightweight Filter-Stream Approach to Named Entity Recognition on Twitter Data</article-title>
          .
          <source>In Proceedings of the 22nd International World Wide Web Conference (Companion Volume)</source>
          , pages
          <fpage>597</fpage>
          -
          <lpage>604</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Mills</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Heilman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yogatama</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Flanigan</surname>
            , and
            <given-names>N. A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments</article-title>
          .
          <source>In Proceedings of the Association for Computational Linguistics (Short Papers)</source>
          , pages
          <fpage>42</fpage>
          -
          <lpage>47</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          , Mausam, and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named Entity Recognition in Tweets: An Experimental Study</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1524</fpage>
          -
          <lpage>1534</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>