<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approach for Sentiment Lexicon Generation using Word Skipgrams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Javi Fernández</string-name>
          <email>javifm@ua.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Natural Language Processing</institution>
          ,
          <addr-line>NLP</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Alicante</institution>
          ,
          <addr-line>Carretera San Vicente del Raspeig S/N, 03690 San Vicente del Raspeig, Alicante</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>21</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>This Ph.D. thesis work proposes the design, development and evaluation of a supervised approach for sentiment lexicon generation. It is based on the hypothesis that an eficient use of the skipgram modelling can improve sentiment analysis tasks and reduce the resources needed maintaining an acceptable level of quality. In summary, the novelty of this approach lies in the use of skipgrams as information units and the way they are eficiently generated, weighed and filtered, taking advantage of the useful information they provide about the sequentiality of the language.</p>
      </abstract>
      <kwd-group>
        <kwd>Word Skipgrams</kwd>
        <kwd>skipgrams</kwd>
        <kwd>skipgram modelling</kwd>
        <kwd>lexicon generation</kwd>
        <kwd>sentiment lexicon</kwd>
        <kwd>sentiment analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This Ph.D. thesis proposes the design, development and evaluation of a sentiment lexicon
generation approach, described in Section 3. Subsequently, in Section 4 we explain the
methodology being used to carry out this work. Finally, Section 5 will explain the specific issues of
research to be discussed. In the following Section 2 we will look at related work in this area.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <p>
        Sentiment Analysis is the task that deals with the computational treatment of opinion, sentiment,
and subjectivity in text [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This field has several subtasks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], such as Aspect-based Sentiment
Analysis, Subjectivity Detection, Emotion Detection or Polarity Classification , but in this work we
will focus on the latter. Polarity Classification is the task that refers to the classification of an
opinionated document as expressing a positive or negative opinion [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The approaches that
can be followed in this context are usually divided into two main groups [
        <xref ref-type="bibr" rid="ref1 ref10 ref9">9, 10, 1</xref>
        ], lexicon-based
approaches and machine-learning-based approaches.
      </p>
      <p>
        In lexicon-based approaches, the polarity for a document is calculated from the semantic
orientation of its words or phrases [11]. These techniques mainly focus on using or building
dictionaries of sentiment words. Dictionaries can be created manually [12] or automatically
[11]. Examples of general and publicly available sentiment dictionaries include WordNet Afect
[13], SentiWordNet [14] or ML Senticon [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, it is dificult to compile and maintain a
universal lexicon, as the same word in diferent domains can express diferent opinions [ 11, 15]
      </p>
      <p>
        The second approach uses machine learning techniques. These techniques require the use
of a polarity labelled corpus to create a classifier capable of classifying the polarity of new
documents. Most of the existing work employs Support Vector Machines [16, 17, 18] or Näıve
Bayes [19, 17, 20], but recent work makes use of Deep Learning [
        <xref ref-type="bibr" rid="ref4">4, 21</xref>
        ]. In this approach, texts are
represented as feature vectors, and a good selection of these features is what mainly improves
the performance. These approaches perform very well in the domain in which they have been
trained but get worse when used in a diferent domain [
        <xref ref-type="bibr" rid="ref6">6, 22</xref>
        ].
      </p>
      <p>Traditionally, these approaches usually do not take into account the sequentiality of the
words contained in the text, so they lose some information during the process. Some techniques
can help to solve this problem, such as Transformers or RNNs [23, 24]. The skipgram modelling
has also shown good results if used eficiently [ 25, 26, 27], and it is the base of our fundamental
research.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Description of the Proposed Research</title>
      <p>The novelty of our research consists on the use of the skipgram modelling technique [28] to
generate, weight and filter multi-word terms to generate a sentiment lexicon, which will be used
as the basis for an automatic polarity classifier. The hypothesis of this work is that an eficient
use of the skipgram modelling technique can not only improve sentiment analysis tasks but
also reduce the resources needed maintaining an acceptable level of quality.</p>
      <p>The skipgram modelling technique consists of obtaining multi-word terms from a text, similar
to n-grams but allowing some words to be skipped. More specifically, in a k-skip-n-gram, 
determines the number of words, and  the maximum number of words that can be skipped. In
this way we are generating additional terms that retain some of the sequentiality of the original
words, but in a more flexible way than n-grams. It is worth noting that n-grams can be defined
as skipgrams where  = 0 (no skips). The skipgram modelling technique is not new in the field
of NLP. There are many approaches that use skipgram modelling to relate words to each other,
but most of them still use words as the basic unit of information [29, 30].</p>
      <p>The main disadvantage of this technique lies in the fact that the number of skipgrams
generated is usually very large. To mitigate this problem, a scoring and filtering process
becomes necessary. In this work, the scoring and filtering is made taking into account diferent
factors: (i) the number of times the term appears in the corpus; (ii) the number of times the
term appears in the corpus for each polarity; (iii) the number of words that the term contains;
and (iv) the (average) number of skips required to obtain that term.</p>
      <p>Furthermore, the weighing and filtering process is not carried out after the generation of
all terms, but is done progressively at build time. In a first phase, single-word terms (  = 1 )
are obtained, which are weighted and filtered. From these filtered terms (and only from these),
two-word terms ( = 2 ) are obtained, which are also weighted and filtered. The iterative process
continues until the desired maximum number of words per term is reached. In this way we
manage to create multi-word terms but in a much more eficient way than generating all the
skipgrams and filtering them in a last step.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Most of the work on this research has already been done. The approach has been developed,
evaluated and compared to some of the existing techniques, carrying out the appropriate
experiments in diferent contexts and diferent datasets, and multiple articles have been published
confirming its efectiveness [ 31, 27, 32, 25, 26, 33]. We can highlight the latest work where we
have used one of these tools to successfully extract sentiment analysis patterns that determine
the virality of tweets about the COVID-19 pandemic [33]. In addition, multiple tools that
use this approach (or a previous version) have been developed in the context of this thesis
[34, 35, 36, 37]. Some of these tools have been commercialised and successfully used by many
clients and businesses.</p>
      <p>However, there is still some work to be done. Since this thesis has been extended over time,
it is necessary to update the state-of-the-art with the latest similar techniques. In addition, we
believe it would be convenient to make an exhaustive study comparing the use of skipgrams
with simple words or n-grams in diferent domains, contexts, textual genres, languages and
learning techniques. Moreover, we also plan to do a detailed study on which words work best
when creating terms using the skipgram modelling, such as its part of speech, its meaning or
its role in the sentence. Finally, we would like to integrate the use of skipgrams into current
techniques such as Word Embeddings or Transformers to see if further improvements can be
achieved.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Specific Issues of Research to be Discussed</title>
      <p>The main questions we want to answer with this thesis are the following:
• Is there any improvement by using skipgrams instead of simple words or n-grams?
• Is it possible to reduce the number of multi-word terms generated by skipgram modelling
to improve speed and resource requirements?
• Is the efectiveness in the use of skipgram modelling dependent on the domain, context,
textual genre, language or learning technique in which it is applied?
• What kind of words work best using skipgram modelling to generate multi-word terms?</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research work has been supported by TRIVIAL (PID2021-122263OB-C22) funded by
MCIN/AEI/10.13039/501100011033 and by “European Union Regional Development Fund (ERDF)
A way of making Europe”, by the “European Union NextGenerationEU/PRTR”.
[11] P. D. Turney, Thumbs up or thumbs down? semantic orientation applied to unsupervised
classification of reviews, arXiv preprint cs/0212032 (2002).
[12] P. J. Stone, D. C. Dunphy, M. S. Smith, The general inquirer: A computer approach to
content analysis. (1966).
[13] C. Strapparava, A. Valitutti, et al., Wordnet afect: an afective extension of wordnet., in:</p>
      <p>Lrec, volume 4, Lisbon, 2004, p. 40.
[14] F. Sebastiani, A. Esuli, Sentiwordnet: A publicly available lexical resource for opinion
mining, in: Proceedings of the 5th International Conference on Language Resources and
Evaluation, 2006, pp. 417–422.
[15] G. Qiu, B. Liu, J. Bu, C. Chen, Expanding domain sentiment lexicon through double
propagation, in: Twenty-First International Joint Conference on Artificial Intelligence,
2009.
[16] T. Mullen, N. Collier, Sentiment analysis using support vector machines with diverse
information sources, in: Proceedings of the 2004 conference on empirical methods in
natural language processing, 2004, pp. 412–418.
[17] R. Prabowo, M. Thelwall, Sentiment analysis: A combined approach, Journal of
Informetrics 3 (2009) 143–157.
[18] T. Wilson, P. Hofmann, S. Somasundaran, J. Kessler, J. Wiebe, Y. Choi, C. Cardie, E. Rilof,
S. Patwardhan, Opinionfinder: A system for subjectivity analysis, in: Proceedings of
HLT/EMNLP 2005 Interactive Demonstrations, 2005, pp. 34–35.
[19] B. Pang, L. Lee, A sentimental education: Sentiment analysis using subjectivity
summarization based on minimum cuts, arXiv preprint cs/0409058 (2004).
[20] J. Wiebe, T. Wilson, C. Cardie, Annotating expressions of opinions and emotions in
language, Language resources and evaluation 39 (2005) 165–210.
[21] Q. T. Ain, M. Ali, A. Riaz, A. Noureen, M. Kamran, B. Hayat, A. Rehman, Sentiment analysis
using deep learning techniques: a review, Int J Adv Comput Sci Appl 8 (2017) 424.
[22] S. Tan, X. Cheng, Y. Wang, H. Xu, Adapting naive bayes to domain adaptation for sentiment
analysis, in: European Conference on Information Retrieval, Springer, 2009, pp. 337–349.
[23] X. Wang, W. Jiang, Z. Luo, Combination of convolutional and recurrent neural network for
sentiment analysis of short texts, in: Proceedings of COLING 2016, the 26th international
conference on computational linguistics: Technical papers, 2016, pp. 2428–2437.
[24] M. Munikar, S. Shakya, A. Shrestha, Fine-grained sentiment classification using bert, in:
2019 Artificial Intelligence for Transforming Business and Society (AITB), volume 1, IEEE,
2019, pp. 1–5.
[25] Y. Gutierrez, D. Tomas, J. Fernandez, Benefits of using ranking skip-gram techniques for
opinion mining approaches, in: eChallenges e-2015 Conference, IEEE, 2015, pp. 1–10.
[26] E. Martınez-Cámara, Y. Gutiérrez-Vázquez, J. Fernández, A. Montejo-Ráez, R.
MunozGuillena, Ensemble classifier for twitter sentiment analysis, NLP Applications: completing
the puzzle (2015) 1–12.
[27] J. Fernández, Y. Gutiérrez, J. M. Gómez, P. Martinez-Barco, Gplsi: Supervised sentiment
analysis in twitter using skipgrams, in: Proceedings of the 8th International Workshop on
Semantic Evaluation (SemEval 2014), 2014, pp. 294–299.
[28] D. Guthrie, B. Allison, W. Liu, L. Guthrie, Y. Wilks, A closer look at skip-gram modelling.,
in: LREC, volume 6, Citeseer, 2006, pp. 1222–1225.
[29] K. W. Church, Word2vec, Natural Language Engineering 23 (2017) 155–162.
[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I.
Polosukhin, Attention is all you need, Advances in neural information processing systems 30
(2017).
[31] J. Fernández Martínez, Y. G. Vázquez, J. M. Gómez Soriano, P. Martínez Barco, A. M.</p>
      <p>Guijarro, R. M. Guillena, Sentiment analysis of spanish tweets using a ranking algorithm
and skipgrams, in: XXIX Congreso de la Sociedad Española de Procesamiento de Lenguaje
Natural: SEPLN 2013, Sociedad Española para el Procesamiento del Lenguaje Natural, 2013,
pp. 133–142.
[32] J. Fernández, J. M. Gómez, P. Martínez-Barco, A supervised approach for sentiment analysis
using skipgrams, in: Proceedings of the Workshop on Natural Language Processing in the
5th Information Systems Research Working Days (JISIC), 2014, pp. 30–36.
[33] E. Saquete, J. Zubcof, Y. Gutiérrez, P. Martínez-Barco, J. Fernández, Why are some
socialmedia contents more popular than others? opinion and association rules mining applied
to virality patterns discovery, Expert Systems with Applications 197 (2022) 116676.
[34] J. Fernández, Y. Gutiérrez, J. M. Gómez, P. Martínez-Barco, Social rankings: análisis visual
de sentimientos en redes sociales, Procesamiento del Lenguaje Natural (2015) 199–202.
[35] J. Fernández, F. Llopis, Y. Gutiérrez, P. Martínez-Barco, Á. Díez, Opinion mining in social
networks versus electoral polls., in: RANLP, 2017, pp. 231–237.
[36] J. Fernández, F. Llopis, P. Martínez-Barco, Y. Gutiérrez, Á. Díez, Analizando opiniones en
las redes sociales, Procesamiento del Lenguaje Natural 58 (2017) 141–148.
[37] I. Moreno, J. Fernández Martínez, Y. Gutiérrez, et al., Social-univ 2.0: Tecnologías del
lenguaje humano, aplicación para la monitorización omnicanal del entorno social de la
universidad de alicante (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Taboada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brooke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tofiloski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Voll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stede</surname>
          </string-name>
          ,
          <article-title>Lexicon-based methods for sentiment analysis</article-title>
          ,
          <source>Computational linguistics 37</source>
          (
          <year>2011</year>
          )
          <fpage>267</fpage>
          -
          <lpage>307</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Baccianella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <article-title>Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining</article-title>
          .,
          <source>in: Lrec</source>
          , volume
          <volume>10</volume>
          ,
          <year>2010</year>
          , pp.
          <fpage>2200</fpage>
          -
          <lpage>2204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F. L.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Troyano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pontes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
          <article-title>Ml-senticon: Un lexicón multilingüe de polaridades semánticas a nivel de lemas</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>53</volume>
          (
          <year>2014</year>
          )
          <fpage>113</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Deep learning for sentiment analysis: A survey</article-title>
          ,
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>8</volume>
          (
          <year>2018</year>
          )
          <article-title>e1253</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Sentiment embeddings with applications to sentiment analysis</article-title>
          ,
          <source>IEEE transactions on knowledge and data Engineering</source>
          <volume>28</volume>
          (
          <year>2015</year>
          )
          <fpage>496</fpage>
          -
          <lpage>509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Opinion mining and sentiment analysis</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>35</volume>
          (
          <year>2009</year>
          )
          <fpage>311</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>A survey of sentiment analysis in social media</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>60</volume>
          (
          <year>2019</year>
          )
          <fpage>617</fpage>
          -
          <lpage>663</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Montejo-Ráez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Martínez-Cámara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Martín-Valdivia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ureña-López</surname>
          </string-name>
          ,
          <article-title>A knowledge-based approach for polarity classification in twitter</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>65</volume>
          (
          <year>2014</year>
          )
          <fpage>414</fpage>
          -
          <lpage>425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Annett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kondrak</surname>
          </string-name>
          ,
          <article-title>A comparison of sentiment analysis techniques: Polarizing movie blogs</article-title>
          ,
          <source>in: Conference of the Canadian Society for Computational Studies of Intelligence</source>
          , Springer,
          <year>2008</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>Sentiment analysis and subjectivity</article-title>
          .,
          <source>Handbook of natural language processing 2</source>
          (
          <year>2010</year>
          )
          <fpage>627</fpage>
          -
          <lpage>666</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>