<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards an Active Learning System for Company Name Disambiguation in Microblog Streams?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria-Hendrike Peetz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damiano Spina</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Gonzalo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten de Rijke</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M.H.Peetz</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>deRijke}@uva.nl</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>damiano</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>julio}@lsi.uned.es</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In this paper we describe the collaborative participation of UvA &amp; UNED at RepLab 2013. We propose an active learning approach for the ltering subtask, using features based on the detected semantics in the tweet (using Entity Linking with Wikipedia), as well as tweetinherent features such as hashtags and usernames. The tweets manually inspected during the active learning process is at most 1% of the test data. While our baseline does not perform well, we can see that active learning does improve the results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1 ISLA, University of Amsterdam
2 UNED NLP &amp; IR Group</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>With increasing volumes of social media data, social media monitoring and
analysis is a vital part of the marketing strategy of businesses. Manual, and
increasingly also automatic, extraction of topics, reputation, and trends around a brand
allows analysts to understand and manage a brand's reputation. Twitter, in
particular, has been used as such a proxy.</p>
      <p>E cient manual and automatic extraction requires ltering and
disambiguation of tweets. Currently, for manual analysis, many non-relevant tweets have
to be discarded. This has an impact on the costs of the analysis. For automatic
analysis, non-relevant tweets might distort the results and decrease reliability.</p>
      <p>Automatic reliable named-entity disambiguation on social media is therefore an
active eld of research. Typically, ltering systems are static (once trained, the
model does not change) and fully automatic (there is no interaction with the
analysts). However, both language and topics around an entity may change over
time and the disambiguation performance is therefore likely to decay.
Additionally, assuming the improvement in performance are worth it, the time to annotate
a handful of tweets a day can easily be spent by analysts. We therefore propose
an active learning approach to company name disambiguation. In particular, we
analyze whether the annotation of a small number of tweets (at most 1% of the
test data) per company improves signi cantly the results.</p>
      <p>The paper is organized as follows. We continue with an introduction of the
proposed approach in Section 2. We proceed with an explanation of the runs
in the experimental setup (Section 3) and analyse the results in Section 4. We
conclude in Section 5.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <p>Our proposed approach is based on active learning, a semi-automatic machine
learning process that interacts with the user for updating the classi cation
model. It selects those instances that may maximize the classi cation
performance with minimal e ort. Figure 1 illustrates the pipeline of the system. First,
the instances are represented as feature vectors. Second, the instances from the
training dataset are used for building the initial classi cation model. Third, the
test instances are automatically classi ed using the initial model. Fourth, the
system guesses the candidate to be prompted to the user. This step is performed
by uncertainty sampling: the instance with least certain as to be correctly
classied is selected. Fifth, the user manually inspects the instance and labels it. The
labeled instance is then considered to update the model. The active learning
process is repeated until a termination condition is satis ed (e.g., the number n
of iterations performed).</p>
      <p>Training Dataset</p>
      <p>Test Dataset</p>
      <p>1. Feature
representation</p>
      <p>2. Model
Training/Update</p>
      <p>Model
1. Feature
representation
3. Classification
4. Candidate</p>
      <p>Selection</p>
      <sec id="sec-3-1">
        <title>Bag of Entities + Twitter metadata (BoE). First, an entity linking sys</title>
        <p>
          tem is used to identify relevant Wikipedia articles to a given tweet. The
COMMONNESS probability [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], based on the intra-Wikipedia hyperlinks, is used to
select the most probable entity for each of the longest n-grams that were linked
to Wikipedia articles from corpora related to the speci c language. Spanish
Wikipedia articles are nally translated to the corresponding English Wikipedia
article by following the interlingual links, using the Wikimedia API.3 Besides
the entities linked to the tweet, special Twitter metadata|hashtags, usernames
and author of the tweet|is also considered as features.
        </p>
        <p>BoE + Bag of Words (BoE+BoW). This second feature representation simply
adds the tokenized text of the tweet to the features in BoE.</p>
        <p>Features are then weighted by two di erent weighting functions:4
Presence Each term is weighted with binary occurrence in the tweet: 1 if
present, 0 otherwise.</p>
        <p>
          Pseudo-document TF.IDF As in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], we consider a pseudo-document D
built from all the tweets given for an entity in the RepLab 2013 training/test
dataset and a background corpus C containing all the Di in the RepLab 2013
collection. Then, the weight w given to the term t is
w(t; D; C) = tf (t; D) log
        </p>
        <p>N
df (t)
where tf (t; D) denotes the term frequency of term t in pseudo-document D and
df (t) denotes the total number of pseudo-documents Di 2 C in which the term
t occurs at least once.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Learning model</title>
        <p>We use Nave Bayes5 (NB) as a classi er and build an initial model. Our active
learning approach can be split into the selection of candidates for active
annotations, annotation of the candidates and updating the model. Therefore, one
iteration of our learning model follows the following three steps:
{ Select the best candidate x from the test set T ;
3 http://www.mediawiki.org/wiki/API:Properties
4 Each linked entity, hashtag, named user and author is treated as a term.
5 http://nltk.org
{ Annotate the candidate x;
{ Update the model.</p>
        <p>The annotations are selected from the test set. The test set depends on the
experimental setup: in the cross-validation scenario, we could use the available
annotations, while in the actual testing scenario, we annotated the candidates
manually.</p>
        <p>
          Candidate Selection. Following [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] the candidate selection can be based on
uncertainty sampling, margin sampling (in particular for support vector
machines [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]) or entropy sampling. Selecting candidates close to the margin is
motivated by selecting candidates where the classi cation is less con dent.
Extending this motivation to NB, we choose to look at the probabilities P (C1jF )
and P (C2jFx)) that a candidate x feature vector Fx generates the classes C1 and
C2. The candidate x to be annotated from the test set T is:
x = arg mi2iTn jP (C1 j Fx)
        </p>
        <p>P (C2 j Fx)j:
(1)
This candidate x is then being annotated and used to update the model.
Model updating Due to the speed of the training of the model, we decided to
retrain NB with every new instance. We assigned all newly annotated instances
a higher weight than the instances in the original training set.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Setup</title>
      <p>
        In the following we describe how we used the training set to select the best feature
groups. Based on this, we describe the runs we submitted. Unlike previous
company name disambiguation datasets, such as the WePS-3 ORM dataset [
        <xref ref-type="bibr" rid="ref1 ref12 ref9">1,12,9</xref>
        ]
and the RepLab 2012 dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the RepLab 2013 collection shares the same
set of entities in training and test datasets. As reputation seems to be
entityspeci c [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we build models per entity.
3.1
      </p>
      <sec id="sec-4-1">
        <title>Training and parameter selection</title>
        <p>
          Section 2.1 lists two feature representations: bag of entities (BoE) and BoE + bag
of words (BoE+BoW). The BoW representation was generated by tokenizing the
text of the tweet using a Twitter-speci c tokenizer [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and removing stopwords
(using both Spanish and English stopword lists). Additionally, the feature values
could be presence or TF.IDF.
        </p>
        <p>
          We used 10 fold cross-validation (10CV) and iterative time-based splitting
(ITS) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] to evaluate the performance of the features. ITS ensures that
classication of past tweets cannot be learnt from future tweets. Thus, we sort the
tweets according to their time stamps and train the classi er on the rst K
tweets and evaluate on the next K tweets. We then train on the rst 2K tweets
and evaluate on the next K tweets, etc. The total accuracy is the mean accuracy
over all splits. We set K = 10. For both 10CV and ITS we used accuracy as our
evaluation metric.
3.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Submitted runs</title>
        <p>The research questions that motivate our selection of submitted runs are:
RQ1 Does annotating a small number (15) of tweets from the test set improve
the results?
RQ2 Do language-dependent models perform better?
We submitted four runs based the research questions, and two additional runs
based on our observation that the data is imbalanced. We submitted two baseline
runs without applying active learning: UvA UNED filtering 1 and UvA UNED
filtering 2. The rst run is language-dependent, i.e., it uses a di erent NB
model per language. The second run is language-independent, i.e., it uses a
combined NB model for both languages. In order to answer RQ1, we submitted
the two active learning runs UvA UNED filtering 3 and UvA UNED filtering 4.
The initial models are based on UvA UNED filtering 1 and UvA UNED
filtering 2, respectively. For the language-dependent case, we annotated 10 tweets per
entity from the test set for English and 5 tweets per entity for Spanish. In the
language-independent case, per entity, we annotated candidate 15 tweets from
the test set. The runs UvA UNED filtering 5 and UvA UNED filtering 6 are
UvA UNED filtering 3 and UvA UNED filtering 4, but when for an entity the
training set related ratio6 was &lt; 0:1 or &gt; 0:9, we used a winner-takes-all strategy.
The winner-takes-all strategy classi es all the tweets as related or unrelated
depending on which class is dominant in the training set.</p>
        <p>
          Table 1 provides an overview over the runs. The o cial results are evaluated
based on accuracy, reliability (R), sensitivity (S) and F(R,S), the F1-measure of
R and S [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>In the following we analyze the results on the training set in Section 4.1. We
then elaborate on the o cial results in Section 4.2.
4.1</p>
      <sec id="sec-5-1">
        <title>Preliminary experiments</title>
        <p>We can, however, see some interesting improvements. For a start, active
learning helps. We can see that the use of 1% annotation improves the results for all
four metrics. Secondly, building a language-independent model performs better
than building two language-dependent models per entity. Finally, we can see that
the class imbalance also holds in the test set, as assigning the majority class for
strongly skewed data performs much better than using active learning alone.
We have presented an active learning approach to company name
disambiguation in tweets. For this classi cation task, we found that active learning does
indeed improve the results in terms of accuracy, reliability, and sensitivity. Since
our initial models perform signi cantly lower than an instance-based learning
baseline (probably due to bugs in the implementation), future work will include
the analysis of the impact of active learning on stronger baselines.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corujo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>WePS-3 Evaluation Campaign: Overview of the Online Reputation Management Task</article-title>
          .
          <source>In: CLEF 2010 Labs and Workshops Notebook Papers</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corujo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meij</surname>
          </string-name>
          , E., de Rijke, M.: Overview of RepLab 2012:
          <article-title>Evaluating Online Reputation Management Systems</article-title>
          .
          <source>In: CLEF 2012 Labs and Workshop Notebook</source>
          Papers (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A General Evaluation Measure for Document Organization Tasks</article-title>
          .
          <source>In: Proceedings SIGIR</source>
          <year>2013</year>
          (
          <article-title>Jul</article-title>
          .)
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bekkerman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mccallum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <article-title>Others: Automatic Categorization of Email into Folders: Benchmark Experiments on Enron and SRI Corpora</article-title>
          .
          <article-title>Center for Intelligent Information Retrieval</article-title>
          ,
          <source>Technical Report IR</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Meij</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weerkamp</surname>
          </string-name>
          , W., de Rijke, M.:
          <article-title>Adding semantics to microblog posts</article-title>
          .
          <source>In: Proceedings of the fth ACM international conference on Web search and data mining</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>O</given-names>
            <surname>'Connor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Krieger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          : Tweetmotif:
          <article-title>Exploratory search and topic summarization for Twitter</article-title>
          . Proceedings of ICWSM pp.
          <volume>2</volume>
          {
          <issue>3</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Peetz</surname>
            ,
            <given-names>M.</given-names>
            H., de Rijke, M.
          </string-name>
          ,
          <string-name>
            <surname>Schuth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>From sentiment to reputation</article-title>
          . In: Forner,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Womser-Hacker</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.) CLEF (Online Working Notes/Labs/Workshop) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Active learning literature survey</article-title>
          .
          <source>Computer Sciences Technical Report 1648</source>
          , University of Wisconsin{Madison (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amigo</surname>
          </string-name>
          , E.:
          <article-title>Discovering lter keywords for company name disambiguation in Twitter</article-title>
          .
          <source>Expert Systems with Applications</source>
          <volume>40</volume>
          (
          <issue>12</issue>
          ),
          <volume>4986</volume>
          {
          <fpage>5003</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Spina</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meij</surname>
            , E., de Rijke,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oghina</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bui</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breuss</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Identifying entity aspects in microblog posts</article-title>
          .
          <source>In: SIGIR</source>
          . pp.
          <volume>1089</volume>
          {
          <issue>1090</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Tong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Support vector machine active learning with applications to text classi cation</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>2</volume>
          ,
          <issue>45</issue>
          {66 (Mar
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Tsagkias</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : The University of Amsterdam at WePS3. In:
          <article-title>CLEF 2010 Labs and Workshops Notebook Papers (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>