=Paper= {{Paper |id=Vol-1202/paper11 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1202/paper11.pdf |volume=Vol-1202 }} ==None== https://ceur-ws.org/Vol-1202/paper11.pdf
      NLP-based Feature Extraction for Automated Tweet
                       Classification

               Anna Stavrianou, Caroline Brun, Tomi Silander, Claude Roux

                         Xerox Research Centre Europe, Meylan, France

                             Name.surname@xrce.xerox.com


  1      Introduction

  Traditional NLP techniques cannot alone deal with twitter text that often does not
  follow basic syntactic rules. We show that hybrid methods could result in a more
  efficient analysis of twitter posts. Tweets regarding politicians have been annotated
  with two categories: the opinion polarity and the topic (10 predefined topics). Our
  contributions are on automated tweet classification of political tweets.


  2      Combination of NLP and Machine Learning Techniques

  Initially we used our syntactic parser [1] which has given high results on opinion
  mining when applied to product reviews [2] or the Semeval 2014 Sentiment Analysis
  Task [3]. However, when applied to Twitter posts, results were not satisfactory. Thus,
  we use a hybrid method and combine knowledge given by our parser with learning.
     Linguistic information has been extracted from every annotated tweet. We have
  used features such as bag of words, bigrams, decomposed hashtags, negation,
  opinions,  etc.  The“liblinear”  library (http://www.csie.ntu.edu.tw/~cjlin/liblinear/) was
  used to classify tweets. We used logistic regression classifier (with L2-regularization),
  where each class c has a separate vector         of weights for all the input features. More
  formally,                         , where is the th feature and the       is its weight
  in class c. When learning the model, we try to find the vectors of weight          that
  maximize the product of the class probabilities in the training data.
     Our objective has been to identify the optimal combination of features that yields
  good prediction results, while avoiding overfitting. Some features used are: Snippets:
  during annotation, we kept track of the snippets that explained why the annotator
  tagged the post with a specific topic or polarity, Hashtags: decomposition techniques
  have been applied to hashtags, and they are analyzed by an opinion detection system
  that extracts the semantic information they carry [4].
     We have selected the models using a 10-fold cross validation in the training data
  and evaluated them by their accuracy in the test data. For the topic-category task,
  (6,142 tweets, 80% used for training), the annotation had <0.4 inter-annotator
  agreement, which shows the difficulty of the task. Table 1. shows the results when
  NLP features are used, as well as when some semantic merging of classes takes place.



In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
146       A. Stavrianou, C. Brun, T. Silander and C. Roux




      Table 1. Cross-validation (2nd col) and prediction (3rd col) results for topic classification.

                   NLP features                     44.38          29.37
                   NLP features + merging           48.91          34.17
    Binary classification was applied to improve the results. We selected the class with
 the highest distribution and annotated the dataset with CLASS1 and NOT_CLASS1
 tags. We created a model for the prediction of CLASS1, the prediction of CLASS2
 and a model for the prediction of the rest of the 8 classes. Merging these models gave
 an accuracy of 40.03%, higher than the max accuracy of Table 1.

  Table 2. Binary classification results (2nd col: cross-validation, 3rd col: prediction) for topic.

  CLASS1/NOT_CLASS1                                                    85.28         62.57
  CLASS2/NOT_CLASS2 (removal of CLASS1)                                92.10         68.42
  The rest of the classes (removal of CLASS2)                          49.58         38.24

    For the opinion polarity task (5,754 tweets, 80% used for training), the inter-
 annotator agreement was higher (~ 0.8). As Table 3. shows, we have used not only
 NLP  features  from  the   tweet  but  also  from  the   ‘snippet’.  The  “syntactic   analysis”  is  
 the opinion tag given from our opinion analyser.

  Table 3. Cross-validation (2nd col) and prediction (3rd col) results for the opinion polarities.

  NLP features (syntactic analysis of opinion)                 61.28 (62.13)          56.77 (56.6)
  NLP features of snippet (syntactic analysis)                 66.41(67.99)           61.2 (61.46)

    As a conclusion, in this paper we provide a model that predicts opinions and topics
 for a tweet in the political context. More research around feature analysis will be
 carried out. We also plan to add more features yielded by our syntactic analyzer such
 as POS tags, or tense. We should also consider a multiple-class labelling.


 3         Acknowledgements

    This work was partially funded by the project ImagiWeb ANR-2012-CORD-002-
 01.


 4         References
  1. Ait-Mokthar, S., Chanod, J.P.: Robustness beyond Shallowness: Incremental Dependency
     Parsing. NLE Journal, 2002.
  2. Brun, C.: Learning opinionated patterns for contextual opinion detection. COLING 2012.
  3. Brun, C., Popa, D., Roux, C.: XRCE: Hybrid Classification for Aspect-based Sentiment
     Analysis. In International Workshop on Semantic Evaluation (SemEval), 2014 (to appear).
  4. Brun, C., Roux, C.: Décomposition des   «   hash   tags   »   pour   l’amélioration   de   la  
     classification en polarité des « tweets ». In TALN, July, 2014.