=Paper= {{Paper |id=Vol-2540/paper38 |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-2540/FAIR2019_paper_5.pdf |volume=Vol-2540 }} ==None== https://ceur-ws.org/Vol-2540/FAIR2019_paper_5.pdf
      Corpus based Amharic sentiment lexicon generation
                   Girma Neshir1, Andreas Rauber2 and Solomon Atnafu3
        1
          Addis Ababa University, IT Doctoral Program, Ethiopia, girma1978@gmail.com
 2
     Technical University of Vienna, Institute of Information Systems Engineering, Austria, rau-
                                       ber@ifs.tuwien.ac.at
              3
                Addis Ababa University, Department of Computer Science, Ethiopia,
                                   solomon.atnafu@aau.edu.et

Introduction: For carrying out Amharic sentiment classification, the availability of
sentiment lexicons is crucial. To date, there are two generated Amharic sentiment lex-
icons. These are manually generated lexicon (1000) [2] and dictionary based Amharic
SWN and SOCAL lexicons [3]. However, dictionary based generated lexicons has
short-comings in that it has difficulty in capturing cultural connotation and language
specific features of the language. This research builds corpus based algorithm to han-
dle language and culture specific words in the lexicons [1]. However, it could proba-
bly be impossible to handle all the words in the language as the corpus is a limited re-
source in almost all less resourced languages like Amharic. But still it is possible to
build sentiment lexicons in particular domain where large amount of Amharic corpus
is available. Due to this reason, the lexicon built using this approach is usually used
for lexicon based sentiment analysis in the same domain from which it is built. The
research questions to be addressed utilizing this approach are: (1) how can we build
an approach to generate Amharic sentiment lexicon from corpus? (2) how do we eval-
uate the validity and quality of the generated lexicon?
Related work: Our work is closely associated to the work of [4] which generated
emotion based lexicon by bootstrapping corpus using word distributional semantics
(i.e. using Positive Point-wise Mutual Information (PPMI)). Our approach is different
from [4] in that we generated sentiment lexicon rather than emotion lexicon. The
other thing is that the approach of propagating sentiment to expand the seeds is also
different. Besides, the threshold selection, the seed words’ part of speech are different
from language to language. For example, Amharic has few adverb classes unlike Ital-
ian [5]. Thus, our seed words do not contain adverbs.
Proposed corpus based approaches: There are variety of corpus based strategies
that include count based (e.g. PPMI) and predictive based (e.g. word embedding) ap-
proaches. In this part, we present the proposed count based approach to generate
Amharic sentiment lexicon from a corpus. The proposed framework of corpus based
approach tries to generate Amharic sentiment lexicon. The framework has four com-
ponents: (Amharic news) corpus collections, preprocessing module, PPMI matrix of
word-context, algorithm to generate (Amharic) sentiment lexicon resulting in the gen-
erated (Amharic) sentiment lexicon. See the framework in Fig.1 of Appendix.
We developed algorithms for constructing Amharic sentiment lexicons automatically
from Amharic news corpus. Corpus based approach is proposed relying on the word
co-occurrence distributional embedding including frequency based embedding (i.e.
PPMI). First we build word-context unigram frequency count matrix and transform it
to point-wise mutual Information matrix. For an experimentally chosen threshold


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0)
2


value, the top closest words to the mean vector of seed list are added to the lexicon.
Then, the mean vector of the new sentiment seed list is updated and process is re-
peated until we get sufficient terms in the lexicon.
Results: Seed words of size 519 are used to expand PPMI based lexicons. With ex-
perimentally obtained threshold value of 100 and 200, we got corpus based Amharic
sentiment lexicons of size 1811 and 3794 respectively. See sample of generated lexi-
con in Table 2 of Appendix. As discussed on dictionary based lexicons in [3] for lexi-
con based sentiment classification, using stemming and negation handling are far im-
proving the performance lexicon based classification. Besides, combination of lexi-
cons outperforms better than the individual lexicon.
Evaluation: We evaluated the generated Amharic sentiment lexicon in two ways: ex-
ternal to lexicon and internal to lexicon. External to lexicon is to test the usefulness
and the correctness of each of the lexicon to find sentiment score of sentiment labeled
Amharic comments corpus. Internal evaluation is compute the degree to which each
of the generated lexicons are overlapped (or agreed) with manual, SOCAL and SWN
(Amharic) sentiment lexicons. Our lexicon detects subjectivity of Amharic facebook
comments has shown an increment of 3.73 more than the subjectivity detection rate of
the manual lexicon. For sentiment classification, the performance of our generated
lexicon for classifying sentiment of Amharic facebook comments has an increment of
6.71 than the manual sentiment lexicon. See evaluation of our lexicon in Table 1 of
Appendix. In addition, the coverage result in a general corpus of 20 million tokens de-
picts that the coverage of PPMI based Amharic sentiment lexicon is better than the
manual lexicon and SOCAL. However, it has less coverage than SWN. Unlike SWN,
PPMI based lexicon is generated from corpus. Due to this reason its coverage to work
on a general domain is limited. It also demonstrated that the positive and negative
count in almost all lexicons seems to have balanced and uniform distribution of senti-
ment polarity terms in the corpus.
Conclusions: This study revealed that it is possible to create sentiment lexicon for
low resourced languages from corpus. This captures the language specific features
and connotations related to the culture where the language is spoken. This cannot be
handled using dictionary based approach that propagates labels from resource rich
languages. To the best of our knowledge, the PPMI based approach to generate
Amharic sentiment lexicon form corpus is performed for first time for Amharic lan-
guage with minimal costs and time. Thus, the generated lexicons can be used in com-
bination with other sentiment lexicons to enhance the performance of sentiment clas-
sifications in Amharic language. The approach is a generic approach which can be
adapted to other resource limited languages to reduce cost of human annotation and
the time it takes to annotated sentiment lexicons. Though the PPMI based Amharic
sentiment lexicon outperforms the manual lexicon, prediction (word embedding)
based approach is recommended to generate sentiment lexicon for Amharic language
to handle context sensitive terms.
                                                                                           3


References
1. D Alessia, Fernando Ferri, Patrizia Grifoni, and Tiziana Guzzo. Approaches, tools and ap-
   plications for sentiment analysis implementation. International Journal of Computer Appli-
   cations, 125(3), 2015.
2. S. Gebremeskel. Sentiment mining model for opinionated amharic texts. Unpublished
   Masters Thesis and Department of Computer Science and Addis Ababa University and
   Addis Ababa, 2010.
3. Girma Neshir Alemneh, Andreas Rauber, and Solomon Atnafu. Dictionary Based Amharic
   Sentiment Lexicon Generation, pages 311--326. 08 2019.
4. Lucia Passaro, Laura Pollacci, and Alessandro Lenci. Item: A vector space model to boot-
   strap an italian emotive lexicon. In Second Italian Conference on Computational Linguis -
   tics CLiC-it 2015, pages 215--220. Academia University Press, 2015.
5. Baye Yimam. (የአማርኛ-ሰዋሰዉ)yäamarIña säwasäw. Educational Materials Production
   and Distribution Enterprise(EMPDE), 2000E.C.
4


    Appendix: List of figures and tables




Fig. 1 Proposed framework

  Table 1. Evaluation of Corpus based Generated Amharic lexicon for Amharic
Facebook Sentiment Classification




    Table 2. Sample of Generated Corpus based Generated Amharic lexicon