<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NLP-based Feature Extraction for Automated Tweet Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Stavrianou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caroline Brun</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomi Silander</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claude Roux</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Xerox Research Centre Europe</institution>
          ,
          <addr-line>Meylan</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Combination of NLP and Machine Learning Techniques Initially we used our syntactic parser [1] which has given high results on opinion mining when applied to product reviews [2] or the Semeval 2014 Sentiment Analysis Task [3]. However, when applied to Twitter posts, results were not satisfactory. Thus, we use a hybrid method and combine knowledge given by our parser with learning. Linguistic information has been extracted from every annotated tweet. We have used features such as bag of words, bigrams, decomposed hashtags, negation, opinions, etc. The“liblinear”  library (http://www.csie.ntu.edu.tw/~cjlin/liblinear/) was used to classify tweets. We used logistic regression classifier (with L2-regularization), where each class c has a separate vector of weights for all the input features. More In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014. Copyright c by the paper's authors. Copying only for private and academic purposes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Traditional NLP techniques cannot alone deal with twitter text that often does not
follow basic syntactic rules. We show that hybrid methods could result in a more
efficient analysis of twitter posts. Tweets regarding politicians have been annotated
with two categories: the opinion polarity and the topic (10 predefined topics). Our
contributions are on automated tweet classification of political tweets.
formally, , where is the th feature and the is its weight
in class c. When learning the model, we try to find the vectors of weight that
maximize the product of the class probabilities in the training data.</p>
      <p>
        Our objective has been to identify the optimal combination of features that yields
good prediction results, while avoiding overfitting. Some features used are: Snippets:
during annotation, we kept track of the snippets that explained why the annotator
tagged the post with a specific topic or polarity, Hashtags: decomposition techniques
have been applied to hashtags, and they are analyzed by an opinion detection system
that extracts the semantic information they carry [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>We have selected the models using a 10-fold cross validation in the training data
and evaluated them by their accuracy in the test data. For the topic-category task,
(6,142 tweets, 80% used for training), the annotation had &lt;0.4 inter-annotator
agreement, which shows the difficulty of the task. Table 1. shows the results when
NLP features are used, as well as when some semantic merging of classes takes place.</p>
      <p>NLP features 44.38 29.37</p>
      <p>NLP features + merging 48.91 34.17</p>
      <p>Binary classification was applied to improve the results. We selected the class with
the highest distribution and annotated the dataset with CLASS1 and NOT_CLASS1
tags. We created a model for the prediction of CLASS1, the prediction of CLASS2
and a model for the prediction of the rest of the 8 classes. Merging these models gave
an accuracy of 40.03%, higher than the max accuracy of Table 1.
CLASS1/NOT_CLASS1
CLASS2/NOT_CLASS2 (removal of CLASS1)
The rest of the classes (removal of CLASS2)</p>
      <p>For the opinion polarity task (5,754 tweets, 80% used for training), the
interannotator agreement was higher (~ 0.8). As Table 3. shows, we have used not only
NLP  features  from  the  tweet  but  also  from  the  ‘snippet’.  The  “syntactic  analysis”  is 
the opinion tag given from our opinion analyser.
NLP features (syntactic analysis of opinion)
NLP features of snippet (syntactic analysis)</p>
      <p>As a conclusion, in this paper we provide a model that predicts opinions and topics
for a tweet in the political context. More research around feature analysis will be
carried out. We also plan to add more features yielded by our syntactic analyzer such
as POS tags, or tense. We should also consider a multiple-class labelling.</p>
      <p>This work was partially funded by the project ImagiWeb
ANR-2012-CORD-0023
01.
4</p>
    </sec>
    <sec id="sec-2">
      <title>Acknowledgements References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ait-Mokthar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chanod</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          :
          <article-title>Robustness beyond Shallowness: Incremental Dependency Parsing</article-title>
          .
          <source>NLE Journal</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Learning opinionated patterns for contextual opinion detection</article-title>
          .
          <source>COLING</source>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popa</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roux</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>XRCE: Hybrid Classification for Aspect-based Sentiment Analysis</article-title>
          .
          <source>In International Workshop on Semantic Evaluation (SemEval)</source>
          ,
          <year>2014</year>
          (to appear).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brun</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roux</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Décomposition des  «  hash  tags  »  pour  l'amélioration  de  la  classification en polarité des « tweets »</article-title>
          .
          <source>In TALN, July</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>