<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Opinion polarity detection in Twitter data combining sequence mining and topic modeling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asma Ouertatani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ghada Gasmi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiraz Latiri</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIPAH, ENSI, University of Manouba</institution>
          ,
          <addr-line>Tunis</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIPAH, FST, University of Tunis El Manar</institution>
          ,
          <addr-line>Tunis</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LISI, INSAT, University of Carthage</institution>
          ,
          <addr-line>Tunis</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a pipeline process to analyze opinion about festivals and cultural events by automatically detecting polarity in Twitter data. Previous studies have focused in the polarity classi cation of individual tweets. However, to understand the polarity of opinion on a domain, it is important to nd themes or topics that occur in the corpus. The rst phase is to nd the optimal number of topics and to identify the major topics via the latent Dirichlet analysis (LDA) topic model. The second stage is to detect polarity in tweets using the sequence mining approach mainly founded on sequences extracted from tweets using a LCM-seq algorithm [9]. The results showed that the polarity detection accuracy of the sequence mining was 84.78%, indicating that the proposed method was valid in most cases.</p>
      </abstract>
      <kwd-group>
        <kwd>topic modeling</kwd>
        <kwd>LDA</kwd>
        <kwd>opinion analysis</kwd>
        <kwd>sequence mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the advent of web 2.0 and social network service evolution, users generated
a massive amount of information stored in unstructured online reviews that can
not simply be used for further processing by computers. Various researchers have
conducted analyses focusing on the exchange of opinions that occurs on social
network platforms.</p>
      <p>Twitter is an online social network where users post and interact with messages,
"tweets", restricted to 140 characters.</p>
      <p>However, discovering sentiments and opinions through manual analysis of a
large volume of textual data is extremely di cult. For that reason, speci c
preprocessing methods and algorithms are needed in order to mine useful patterns.
Hence, in recent years, there have been much interests in the natural language
processing community to develop novel text mining techniques with the
capability of accurately extracting users' opinions from large volumes of information
like Twitter data.</p>
      <p>
        Among various opinion mining tasks, one of them is polarity analysis, i.e. whether
the semantic orientation of a text is positive or negative, which focuses on
classifying the polarity of individual texts (e.g., web reviews or tweets) by selecting
important features through methods such as n-grams [10, 11], word subsequence
[12], information gain [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and recursive feature elimination [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. When applying
machine learning to opinion classi cation, most existing approaches rely on
supervised learning models trained from labeled corpora where each document has
been labeled as positive or negative prior to training. A tweet is then classi ed
via algorithms, such as the nave nave, maximum entropy [11], or support vector
machine (SVM) algorithms. However, sentiment classi cation models trained on
one domain might not work at all when moving to another domain. Furthermore,
in a more ne-grained opinion classi cation problem (e.g nding users0 opinions
for a particular lm festival), topic detection and opinion classi cation are often
performed in a two-stage pipeline process, by rst detecting a topic and later
assigning a polarity label to that particular topic.
      </p>
      <p>We propose a pipeline process to analyze opinion about festivals and cultural
events by automatically detecting polarity in Twitter data. Previous studies have
focused on the polarity classi cation of individual tweets.</p>
      <p>However, to understand the polarity of opinion on a domain, it is important to
nd themes or topics that occur in the corpus. Our goal here is to nd the
optimal number of topics and to identify the major topics via the latent Dirichlet
analysis (LDA) topic model. The second stage detects polarity in tweets
using the sequence mining approach mainly founded on sequences extracted from
tweets using a LCM-seq algorithm.</p>
      <p>The remainder of this paper is organized as follows. Section 2 details the
proposed method, which includes a data-preprocessing step; Section 3 presents the
analysis results; and Section 4 presents the conclusion of this study and discusses
directions for future research.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed method</title>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>
          The MC2@CLEF2017 lab has released a collection of 70 000 000 microblogs over
18 months dealing with cultural events [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Microblogs are in all languages. We
used just 5 000 000 tweets from the collection.
        </p>
        <p>Simple and intiutive techniques in the preprocessing phase were evoked as
removal links, twitter identi ers, pontuations and stop words.</p>
        <p>
          Clearly cannot be performed without knowing the underlying language
detection. Therefore, modern text processing tools heavily rely on highly e ective
algorithms for language. We employed the Cavnar and Trenkle [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] approach to
text categorization based on character n-gram frequencies that have been
particularly successful.
        </p>
        <p>We used the implementation in the R extension package textcat aims at both
exibility and convenience. After the preprocessing phrase we chosed the rst
320000 english tweets to be our dataset. Figure 1 presents a words cloud from
our dataset. The word cloud principle is based on a text analysis method that
allows us to highlight the most frequently used keywords ( like : music, Film..)
Topic modeling is a type of statistical model in natural language processing that
aims to nd topics in a corpus, group topics together by looking for similarity
and co-occurence, and categorize documents in the corpus based on the topic
probabilities assigned.</p>
        <p>
          We are speci cally using a statistical method called the latent Dirichlet
allocation (LDA). Latent Dirichlet Allocation (LDA) is one of the most popular topic
models [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In the context of LDA, a topic is composed of terms with creation
probabilities. For each term position in a document, LDA identi es a topic, and
the topic is composed of the terms included in the topic, measured
probabilistically. Given a set of documents, LDA provides an algorithm that learns the
topics and the terms associated with each topic. LDA requires one input
parameter: the number of topics to extract.
        </p>
        <p>And now the question then arises as: What is the best way to determine k
(number of topics) in topic modeling?</p>
      </sec>
      <sec id="sec-2-2">
        <title>Optimal number of topics for LDA model :</title>
        <p>Before going right into generating the topic model and analysing the output, we
need to decide on the number of topics that the model should use. We used 3
metrics to estimate the best tting number of topics:
{ Method based on the harmonic mean :</p>
        <p>This method has rst been applied by Gri ths and Steyvers [8].
We calculated the harmonic mean of a the values sets of p(wjz; k). The model
that we will retain by varying k will be the one which will have the highest
value.
z : Per word topic assignment.
w : word.</p>
        <p>
          k : number of topic.
{ Density-based method [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
        </p>
        <p>The principle is to calculate the similarity (or distance) between all pairs
of themes for di erent models obtained by varying the number of themes.</p>
        <p>
          Themes are more independent if the similarity between themes is small.
{ Method based on the Kullback-Leibler divergence (KL) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
        </p>
        <p>The measure of divergence is a measure of how the topic1 distribution for
document m and the word distribution for topic1 diverges from a second
topics expected probability distribution.</p>
        <p>The optimal k is the one with the lowest divergence. The three methods
required to train multiple LDA models to select one with the best performance.
So, the best way is to calculate all metrics at once, the gure 2 represents the
Results calculated on the whole dataset:
The three methods agree that somewhere between 75 and 100 topics is optimal
for this dataset. To nd the best value of the number of topics hyperparameter
k we used the perplexity measure for the applicability of a topic model to new
data and the 5 folds cross validation over the range of k [75..100]. Perplexity is
a measure of how well a probability model predicts a sample. We opted to t a
model with 85 topics. In the gure 3 the plot of the results:
Terms are assigned to a topic with probabilities, so every term in the corpus is
given a probability per topic. However, we can use the top terms to get a sense
for what each topic covers. Figure 4 shows the topics names. For the second
stage of our approach we used the lms topic.
Before starting the phase of analysis of polarity one must go through the stage
of the analysis of subjectivity to remove the objective tweets of our collection.
To do this, we used the subjectivity lexicon 1, and N-gram as features and the</p>
        <sec id="sec-2-2-1">
          <title>1 http://mpqa.cs.pitt.edu/lexicons/subj-lexicon</title>
          <p>nave bayes as classi er.</p>
          <p>For the polarity detection, we used lexique Wordnet sentiment, T f idf and the
algorithm LCM-seq 2 to extract all frequent item sequences. to use it as features.
Lcm-seq : is an e cient algorithm for enumerating frequent sequence patterns
from a sequential database. In addition to its high speed, LCM-seq can be
applied in a variety of ways, as it can assign a positive or negative weight to each
sequence and only extract frequent sequence patterns that appear in a speci ed
window width [9].</p>
          <p>For a vocabulary V , the set of nite sequences on V is expressed by V . A
sequence pattern is an arbitrary sequence s = a1::::an V , and P = V
expresses the set of all sequence patterns on V . The sequence database on V is
the sequence set S = s1; :::; sm. We denote the the size of S by jSj. For sequence
pattern p 2 P , a sequence database including p is called an occurrence of p.
The denotation of p, denoted by (p) is the set of the occurrences of p. j (p)j
is called the frequency of p, and denoted byF req. For given constant 2 N ,
called a minimum support, sequence pattern p is frequent if F req(p) . In our
approach, we used a value min sup equal to 100.
3
3.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <sec id="sec-3-1">
        <title>Experimental validation</title>
        <p>For the phase of the subjectivity analysis we used as a training corpus
introduced in Pang/Lee ACL 2004 3 we used the Subjectivity lexicon and N-gram as
features.</p>
        <p>For the polarity detection we used the sentiment140 data as a training data 4,
and we used the frequent item sequences as features for nave bayes classi er.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation protocol</title>
        <p>As evaluation meteric we used the classi er Accuracy.</p>
        <p>The accuracy can be de ned as the percentage of correctly classi ed instances :
Accuracy =</p>
        <p>(T P + T N )
(T P + T N + F P + F N )
(1)
Where TP, FN, FP and TN represent the number of true positives, false
negatives, false positives and true negatives, respectively.</p>
        <p>The following table illustrates the results for the nave bayes classi er :</p>
        <sec id="sec-3-2-1">
          <title>2 http://research.nii.ac.jp/uno/code/LCM-seq.html 3 http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html 4 http://help.sentiment140.com/for-students]</title>
          <p>The polarity detection aims to automatically classify the customer opinion and
provide comprehensive understanding of customer feedback from raw data on the
Web. In all of the social network platforms, Twitter has been one of the most
popular sources for marketing information research and sentiment classi cation.
The work described in this paper is a step towards e cient classi cation of tweets
using the topic modelling.
Dublin, Ireland, 11/09/2017-14/09/2017, volume 10456 of Lecture Notes in
Computer Science, http://www.springerlink.com, 2017. Springer.
8. T. L. Gri ths and M. Steyvers. Finding scienti c topics. Proceedings of the</p>
          <p>National academy of Sciences, 101(suppl 1):5228{5235, 2004.
9. T. Nakahara, T. Uno, and K. Yada. Extracting promising sequential patterns from
r d data using the lcm sequence. In Knowledge-Based and Intelligent Information
and Engineering Systems. Springer, 2010.
10. A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opinion
mining. In LREc, volume 10, 2010.
11. B. Pang, L. Lee, and S. Vaithyanathan. Thumbs up?: sentiment classi cation using
machine learning techniques. In Proceedings of the ACL-02 conference on Empirical
methods in natural language processing-Volume 10, pages 79{86. Association for
Computational Linguistics, 2002.
12. R. Xia, C. Zong, and S. Li. Ensemble of feature sets and classi cation algorithms
for sentiment classi cation. Information Sciences, 181(6):1138{1152, 2011.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thoms</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Fu</surname>
          </string-name>
          .
          <article-title>A ect analysis of web forums and blogs using correlation ensembles</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>20</volume>
          (
          <issue>9</issue>
          ):
          <volume>1168</volume>
          {
          <fpage>1180</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Arun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Suresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Veni Madhavan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. Narasimha</given-names>
            <surname>Murthy</surname>
          </string-name>
          .
          <article-title>On nding the natural number of topics with latent dirichlet allocation: Some observations</article-title>
          .
          <source>Advances in Knowledge Discovery and Data Mining</source>
          , pages
          <volume>391</volume>
          {
          <fpage>402</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>M. I.</given-names>
          </string-name>
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of machine Learning research</source>
          ,
          <volume>3</volume>
          (Jan):
          <volume>993</volume>
          {
          <fpage>1022</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Tang</surname>
          </string-name>
          .
          <article-title>A density-based method for adaptive lda model selection</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>72</volume>
          (
          <issue>7</issue>
          ):
          <volume>1775</volume>
          {
          <fpage>1781</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Cavnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Trenkle</surname>
          </string-name>
          , et al.
          <article-title>N-gram-based text categorization</article-title>
          . Ann Arbor MI,
          <volume>48113</volume>
          (
          <issue>2</issue>
          ):
          <volume>161</volume>
          {
          <fpage>175</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>T</surname>
            .-Y. Chu,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Beaupre</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.-R.</given-names>
          </string-name>
          <string-name>
            <surname>Pouliot</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wakim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Leclerc</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ding</surname>
          </string-name>
          , et al.
          <article-title>Bulk heterojunction solar cells using thieno [3, 4-c] pyrrole-4, 6-dione and dithieno [3, 2-b: 2, 3-d] silole copolymer with a power conversion e ciency of 7.3%</article-title>
          .
          <source>Journal of the American Chemical Society</source>
          ,
          <volume>133</volume>
          (
          <issue>12</issue>
          ):
          <volume>4250</volume>
          {
          <fpage>4253</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>L.</given-names>
            <surname>Ermakova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mulhem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Sanjuan.</surname>
          </string-name>
          <article-title>CLEF 2017 Microblog Cultural Contextualization Lab Overview (regular paper)</article-title>
          .
          <source>In Experimental IR Meets Multilinguality</source>
          , Multimodality, and Interaction, CLEF,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>