=Paper=
{{Paper
|id=Vol-1180/CLEF2014wn-Rep-CossuEt2014
|storemode=property
|title=LIA@Replab 2014
|pdfUrl=https://ceur-ws.org/Vol-1180/CLEF2014wn-Rep-CossuEt2014.pdf
|volume=Vol-1180
|dblpUrl=https://dblp.org/rec/conf/clef/CossuKFGE14
}}
==LIA@Replab 2014==
LIA@Replab 2014 : 10 methods for 3 tasks Jean-Valère Cossu, Kilian Janod, Emmanuel Ferreira, Julien Gaillard and Marc El-Bèze LIA/Université d’Avignon et des Pays de Vaucluse ?? 39 chemin des Meinajaries, Agroparc BP 91228, 84911 Avignon cedex 9, France firstname.name@univ-avignon.fr Abstract. In this paper, we present the participation of the Laboratoire Informatique d’Avignon (LIA) to RepLab 2014 edition [2]. RepLab is an evaluation campaign for Online Reputation Management Systems. LIA has produced an important number of experiments for every tasks of the campaign: Reputation Dimensions and both Author Categorization and Author Ranking sub-tasks from Author Profiling. Our approaches rely on a large variety of machine learning methods. We have chosen to mainly exploit tweet contents. In several of our experiments we have also added selected meta-data. A fewer number of our proposals have integrated external information by using provided background messages. 1 Introduction RepLab addresses the challenging problem of online Reputation analysis, i.e. mining and understanding opinions about companies and individuals by extract- ing information conveyed in tweets. Here, the end-user application is monitoring the reputation of several entities from Twitter messages. This year the organiz- ers defined two tasks, namely Reputation Dimensions and Author Profiling. The last one is divided in two sub-tasks respectively Author Categorization and Au- thor Ranking. In this context, LIA’s participants have proposed several methods to automatically annotate tweets according to this problematic. We took part into each task. The rest of this paper is structured as follows. In section 2, we briefly discuss about data-set and RepLab tasks. In section 3, we present the LIA’s submitted systems. Then in section 4, performances are reported before concluding and discussing some future works. 2 Tasks and Data 2.1 Reputation Dimensions Data The corpus consists of the same multilingual collection of tweets as last edition [1] referring to a set of 61 entities spread in four domains: automotive, banking, universities and music/artists. Replab 2014 will use only the automotive ?? http://lia.univ-avignon.fr/ 1458 and banking subsets (31 entities). These tweets cover a period going from the 1st of June 2012 to the 31st of December 2012. Entitie’s canonical names have been used as queries to extract tweets from a larger database. For each entity, at least 2,200 tweets have been collected. The 700 first tweets have been taken to compose the training set, and the other ones are used as test set. Consequently, tweets concerning each of the four tasks are not homogeneously distributed in the data-set. The corpus also provides additional background tweets for each entity (up to 50,000, with a large variability across entities). Each tweet is categorized into one of the following reputation dimensions: Products/Services, Innovation, Workplace, Citizenship, Governance, Leadership, Performance and Undefined We have selected 3,000 tweets from the training collection to build a devel- opment set. As shown in table 1 there is bias with one class. Table 1. Classes distribution in the training set. Label Number of tweets Citizenship 2209 Governance 1303 Innovation 216 Leadership 297 Performance 943 Products & Services 7898 Undefined 2228 Workplace 468 The Reputation Dimensions is a classification tasks that consists in catego- rizing tweets according to their reputation dimension. The standard categoriza- tion provided by the Reputation Institute 1 is used as a gold standard. We may question about what is exactly the meaning of this task since there is a doubt on how the reference has been produced. 2.2 Author Profiling Data For the author profiling task, the data set consists of over 8,000 Twitter profiles (all with at least 1,000 followers) related to the automotive and banking domains. Each profile consists of : – author name – profile URL – the last 600 tweets published by the author at crawling time 1 http://www.reputationinstitute.com/about-reputation-institute/ the-reptrak-framework 1459 Reputation experts have manually identified the opinion makers (i.e. authors with reputation influence) and annotated them as “Influencer”. All those profiles that are not considered opinion makers were assigned the “Non-Influencer” label. Profiles for thoses it was not possible to perform a classification into one of these categories have been labeled as “Undecidable”. Each opinion maker has been categorized as journalist, professional, authority, activist, investor, company, or celebrity. The data has been split into training and test sets, the proportion is respectively 30% and 70% . Author Categorization goal’s is to classify Twitter profiles by type of author: journalist, professional, authority, activist, investor, company or celebrity. The systems’ output is a list of profile identifiers with the assigned categories, one per profile. Note that this sub-task has been evaluated only over the profiles annotated as “Influencer” in the ”Author Ranking” gold standard. Author Ranking objective’s is to find out which authors have more reputation influence (who the influencers or opinion makers are) and which profiles are less influential or have no influence at all. For a given domain (e.g. automotive or banking), the system’s output had to be a ranking of profiles according to their probability of being an opinion maker with respect to the concrete domain, optionally including the corresponding weights. Some aspects that determine the influence of an author in Twitter – from a reputation analysis perspective – can be the number of followers, the number of comments on a domain or the type of author. 3 Approaches In this section we propose descriptions of the LIA’s approaches used in this edition. Among our 10 approaches, note that parts were also used in the last edition [4]. As some systems are a combination of several methods our systems list can be found resumed in Table 2. 3.1 Cosine distance with TF-IDF and Gini purity criteria We proposed a supervised classification method based on a cosine distance com- puted over vectors built using discriminant features like Term Frequency-Inverse Document Frequency (TF-IDF) [13], [12] using the Gini purity criteria [14]. This system consists in two steps. First the text is cleaned by removing hyper- text links and punctuation marks and we generate a list of n-grams by using the Gini purity criteria. During this step stop-lists (from Oracle’s website) 2 for both English and Spanish have been used. In the second step we creates terms (words or [2/3]-grams) models for each class by using term frequency with the TF-IDF 2 http://docs.oracle.com 1460 and Gini criterion. A cosine distance measures the similarity of a given tweet by comparing its bag of words to the whole bag built for each class and ranks tweets according to this measure. This classification process takes into account the following meta-data : 1. user id; 2. entity id / domain id; 3.2 Hidden Markov Models Hidden Markov models (HMM) have been widely used for categorization [15]. For each class k, a language model Lmk is built from the train set. The language model Lmk is made of uni-gram probabilities and of probabilities Pk(w — h), where histories h are obtained from chunks automatically selected . Conditional probabilities are estimated from the annotated tweets of the train set assuming that a term is considered as a unique event even though it is occurring several in a tweet (or used by an author). As before meta-data were included into the classification process. 3.3 Poisson modeling Another approach inspired by the method used for the fast match component of a speech recognition system [3] has been also applied in parallel : although the corpus is not so small, it is interesting to use the Poisson law since it is well suited to take into account the sparse distribution of relevant features f mainly for the under populated classes. 3.4 Naive use of continuous Word2Vec model[8] Word2vec is an unsupervised algorithm that give a fixed length vector repre- sentation for words. Word2vec proved their ability to extract semantics relation between words[9]. This mean that ”king”’s vector is closer to ”queen”’s vector than ”cat”’s vector. We exploit naively this information to do an unsupervised classification. At first, two wor2vec models where built[11]. The first model was made for English from the Brown corpus and every English tweet contained in the background corpus. The second model was made for Spanish from various resources [7] and Spanish tweets in the background corpus. The label ”Products & Services” was split in two during classification and re-merge later. Then a naive hypothesis was made. The hypothesis was that the name of each class (citizenship,innovation ... ) represents the meaning of the class and so the vector representation of a tweet wich can be classified must be somehow close to the vector representation of the class name. To achieve this class names were translated from english to spanish manualy and each tweets were preprocessed (like tockenization and stop word removing ...). Then each words is labeled with the closest class and the majority class give the tweet a label. 1461 3.5 Multilayer Perceptron This classifier use two Word2vec models, one for English and one for Spanish and a multilayer perceptron (MLP) A multilayer perceptron is a feed-forward neural network model. In MLP each neurons use a nonlinear activation function. MLP are train with back-propagation. Our MLP used a 1 input layer with 2500 units, 1 hidden layout with 200 units , 1 output layout with 8 units and L2 normalization. The input was a 5 Words vectors concatenated. So each tweet had to be split with a five words sliding window. Each word is replaced by its Word2vec [8] representation inside of the sliding window. Then the MLP is trained with the concatenated vector made from the sliding window as intput and with the tweet’s label as output. During the classification task the Multilayer Perceptron labeled each window. The final label for the entire tweet is chosen by majority rule from the different windows given a tweet. 3.6 Conditional random field [6] CRFs represent a log-linear model, normalized at the sentence level. CRFs, though very comparable, have many advantages over hidden Markov models and maximum entropy Markov models (MEMM). HMMs model the joint porta- bility between the observed sequences and tag sequences while CRFs are based on the conditional probability of tags considering the entire sequence. MEMM also maximize this conditional probability but only for local states. In our case, CRFs model the probability between class and words as follows: N 1 Y n+2 P (cN N 1 |w1 ) = H(cn−1 , cn , wn−2 ) (1) Z n=1 with M X n+2 n+2 H(cn−1 , cn , wn−2 )= λm · hm (cn−1 , cn , wn−2 ) (2) m=1 Log-linear models are based on feature functions hm representing the information extracted from the training corpus, λ are estimated during the training process, Z is a normalization term given by: N XY n+2 Z= H(cn−1 , cn , wn−2 ) (3) cN 1 n=1 The tweets from the training set were used to train our CRF tagger with unigram (5 neighbors) and bigram features. Then a CRF tagged each unigram in every tweets and decision for the final tweet’s label is made by majority 1462 Table 2. LIA’s systems for RepLab 2014 # Method Description 1 HMM with TF-IDF and Gini purity criteria 2 Cosine distance with TF-IDF and Gini purity criteria 3 Poisson with TF-IDF and Gini purity criteria 4 Merge of HMM and Cosine (global models) 5 Merge of HMM, Poisson and Cosine (per lang specific models) 6 Multilayer Perceptron 7 Conditional random field 8 Naive Word2vec 9 Merge of Multilayer Perceptron, CRF, Naive and 4 10 Merge of 4 and 5 4 Submissions and results 4.1 Systems Ten methods compose the LIA’s set of submissions. For reading convenience, these methods are summed up in table 2 and refer to a method number used in results table presented above. We now compare our result with regards to the baselines and also to the best score in a given task. 4.2 Reputation Dimensions Table 3. Submitted runs to Reputation Dimensions Task ordered by F-Score. #Run-ID #Method F-Score Accuracy - Best 0,489 0,695 - SVM Baseline 0,380 0,622 Run 2 6 0,258 0,612 Run 1 7 0,258 0,607 Run 5 9 0,238 0,595 Run 4 4 0,160 0,549 - Naive Baseline 0,152 - Run 3 8 0,121 0,356 As shown (in table 3) most of our runs, ranked according to F-Score are situated between the SVM and most frequent baselines. All our systems are under the SVM baseline. As our systems were biased by the most frequent class we mainly performed bad in term of per-class F-score (computed with precision and recall) although they are not so bad in terms of accuracy. Runs 2 and 1 used separate models for both English and Spanish languages while runs 4 and 3 used a global model. Run 1 also use the background tweets. The run 6 only used tweet’s Word2vec information. Adding other source of information will make 1463 the system do better decision. Likewise we can try to add more hidden layer now that we have more training data or add an unsupervised phase of pre- training. The Naive run (Run 3) did not perform well compared to others. On one hand its ability to infer meanings and semantic distance between words bring new information to the system. On the other hand due to our hypothesis this system bring a lot of noises. Word2vec have already proved that they are able to summarize information contain in a document[10] and thanks to the MLP we know that there is usefull information for this task in the Word2vec model. With this information there is many things we want to do in order validate/invalidate our usage of Word2vec model. The combination (run 5) has not been able to produce a good selection rules since it performances remains lower than the best system taken alone mostly due to the noise given by the Naive system. Table 4. Classes distribution in gold-standard and systems output. Label Run 1 Run 2 Run 3 Run 4 Run 5 Gold Baseline Citizenship 4578 3303 7485 855 3188 5027 3263 Governance 1209 1226 1372 465 507 3395 2131 Innovation 54 5 337 38 18 306 27 Leadership 286 46 117 72 120 744 352 Performance 916 1070 10765 266 1284 1598 668 Prod & Svc 20713 24513 12922 29233 25696 15903 19920 Undefined 2720 1186 6 567 383 4349 5303 Workplace 1154 281 58 136 434 1124 241 Classes distribution (in table 4) explains the low performance level our sys- tems (shown in table 3) since they are all biased to Products&Services. As an interesting result we can notice that the Naive run (run 3) over-estimated the Performance class. 4.3 Author Profiling Author Categorization Ranked according (table 5) to the average accuracy only one system is better than both ”most frequent” and ”Machine Learning” (SVM) baselines. One our of system is near the SVM baseline for ”Automotive” accuracy while it outperforms the ”Banking” accuracy of the baseline. A second system is far behind the baselines while the combination is worse. Run 1 used two different systems combinations depending on the language. For English tweets HMM and Poisson were combined. Whereas in spanish Cosine was added to the above combination because there was less data. In the second run combined Cosine and HMM where trained with global models without separating languages. Here again our combination (run 3) has 1464 not been able to produce a good selection rules since it does worse than all systems taken alone. Both baselines produced interesting results since they performed well. Since they are over all other candidates we can consider them as very strong baselines. Another interesting fact is that the ”Stockholder” users were not found by any systems. With regard to the label distribution in the training set, we decided to have an harmonization post-process of our systems output for this task. The post- process consist for each output to consider the second hypothesis of the system in the following case : – The best hypothesis is an over populated class 3 – The second hypothesis is an under populated class – The score differential between the both hypothesis is not significant In this case the system will full-up small classes despite it has a better confidence in a bigger class. Although this strategy implies as sacrifice some losses in terms of accuracy, it allows the system to be better with small classes. Depending on the chosen evaluation metric this strategy can perform well. Author Ranking The run uses the same interesting double combination of Poisson and HMM for both English and Spanish tweets as in “Author Cat- egorization” task. We interpreted this task as a binary classification problem for each author. System considered if each tweet in the author bag of tweets in opinionated or not. Considering now the majority label the system decides whether the user is “opinion maker” or not. To rank users we use the probability of the “opinion maker” label on his bag of tweets. In case of parity we add the probability of a HMM system trained with global models. As in the Author Categorization task our Author Ranking output was post- processed in order the obtain an approaching ratio of “opinion maker” as the training set. Since there were only 2 classes in this task, our post-process can be considered as an offset and threshold set on the probability of one class. 3 The notion of over or under population is considerd with regards to the class distri- bution in the training set. Table 5. Submitted runs to Author Categorization Task ordered by Average accuracy. #Run-ID #Method Automotive Banking Misc Average F-Score Run 1 5 0,445 0,502 0,461 0,473 0,319 - Baseline-SVM 0,426 0,494 - 0,460 0,302 - MF-Basline 0,450 0,420 0,51 0,435 - Run 2 4 0,356 0,397 0,376 0,377 0,294 Run 3 10 0,292 0,308 0,369 0,300 0,255 1465 Table 6. Classes distribution in gold-standard and systems output. Label Run 1 Run 2 Run 3 Gold Baseline Public Institution 24 36 60 90 78 NGO 181 190 331 233 49 Stockholder 0 0 0 7 0 Sportsmen 157 219 364 208 7 Journalist 859 1407 1700 991 708 Employee 1 2 3 14 0 Undecidable 1972 1264 515 1412 2851 Celebrity 39 318 347 208 0 Professional 1492 1278 1291 1546 1144 Company 151 165 269 222 82 Table 7. Submitted run, best run and baseline to Author Ranking Task ordered by Average MAP. #Run-ID #Method Automotive Banking Average MAP Best - 0,721 0,410 0,565 Run 1 5 0,502 0,450 0,476 Baseline - 0,370 0,385 0,378 5 Conclusions and perspectives In this paper we have presented the systems as well as the performances reached by the Laboratoire Informatique d’Avignon to RepLab 2014. We have presented a large variety of approaches and observed logically a large variety of system performances even about one system in several tasks. Our results are good in both subtasks of ”Author Profiling” but it seems like we missed something in the ”Reputation Dimensions” We have also proposed several combinations of systems in order to benefit from the diversity of information considered by our runs but it did not worked as expected. Sign that our results could still be improved by looking for another way of considering the data and our systems output during both classification and merging processes. While the mass of data has caused us many troubles, in a future work, we will propose to automatically summarize tweets clusters or users profiles in order to reduce our representation and perform a faster classification. As we have already done on the ImagiWeb dataset [5] we intend to apply an active learning strategy to answer the Replab issue. References 1. Amigó, E., Corujo, A., Gonzalo, J., Meij, E., de Rijke, M. Overview of RepLab 2013: Evaluating Online Reputation Management Systems CLEF 2013 Labs and Workshop Notebook Papers (2013). 2. Amigó, E. ,Carrillo-de-Albornoz, J., Chugur, I., Corujo, A., Gonzalo, J., Meij, E., de Rijke, M., and Spina, D Overview of RepLab 2014: author profiling and 1466 reputation dimensions for Online Reputation Management In Proceedings of the Fifth International Conference of the CLEF initiative, 2014, sep, Sheffield, UK 3. Bahl, R.L. and Bakis, R. and De Souza, P.V. and Mercer, R. Obtaining candidate words by polling in a large vocabulary speech recognition system In Proceedings of ICASSP 1988 (pp 489-492 vol.1). 4. Cossu J.-V., Bigot B., Bonnefoy L., Morchid M., Bost X., Senay G., Dufour R., Bouvier V., Torres-Moreno J.-M., El-Bèze M. LIA@RepLab 2013 An evaluation campaign for Online Reputation Management Systems (CLEF’13), 23-26 Septem- ber 2013. 5. Cossu J.-V., El-Bèze M., Sanjuan E., and Torres-Moreno J.-M E-reputation mon- itoring on Twitter with active learning automatic annotation Techreport hal- 01002818, April 2014. 6. Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 7. Lara, L.F. and Chande, R.H. and Hidalgo, M.I.G. Investigaciones lingüı́sticas en lexicografı́a, 1979, Colegio de México, Centro de Estudios Lingüı́sticos y Literarios 89. 8. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. 9. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Con- tinuous Space Word Representations. In Proceedings of NAACL HLT, 2013. 10. arXiv:1405.4053 Quoc V. Le, Tomas Mikolov, Distributed Representations of Sen- tences and Documents 11. Radim Řehůřek and Petr Sojka Software Framework for Topic Modelling with Large Corpora, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks,2010, p45-50, ELRA, http://is.muni.cz/publication/884893/en 12. Robertson,S. Understanding inverse document frequency: on theoretical arguments for IDF In Journal of Documentation, 60, 5, pp 503-520, 2004, Emerald Group Publishing Limited. 13. Salton, G. et Buckley, C. Term weighting approaches in automatic text retrieval In Information Processing and Management 24, pp 513–523, 1988. 14. Torres-Moreno, J.-M., El-Beze, M., Bellot, P. and Bechet, Opinion detection as a topic classification problem In in Textual Information Access. Chapter 9, pp 337, John Wiley & Son. 2013 15. Wang, L., and Li, L. Automatic Text Classification Based on Hidden Markov Model and Support Vector Machine In Proceedings of The Eighth International Confer- ence on Bio-Inspired Computing: Theories and Applications (BIC-TA), 2013 (pp. 217-224). 1467