=Paper=
{{Paper
|id=Vol-3315/paper02
|storemode=property
|title=Annotated Lexicon for Sentiment Analysis in Bosnian
    Language
|pdfUrl=https://ceur-ws.org/Vol-3315/paper02.pdf
|volume=Vol-3315
|authors=Sead Jahić, Jernej Vičič
}}
==Annotated Lexicon for Sentiment Analysis in Bosnian
    Language==
<pdf width="1500px">https://ceur-ws.org/Vol-3315/paper02.pdf</pdf>
<pre>
Annotated Lexicon for Sentiment Analysis in Bosnian
Language
Sead Jahić1 , Jernej Vičič1
1 Faculty of Mathematics, Natural Science and Information Technologies, University of Primorska, Koper, Slovenia


             Abstract
             The paper presents the first sentiment annotated lexicon of the Bosnian language. The language coverage of the
             lexicon was evaluated using two reference corpora. The usability of the lexicon was already proven on a Twitter-based
             comparison. Two approaches were observed in this experiment, the first method used a frequency list of all lemmas
             extracted from two relevant Bosnian language corpora, and the second method used all lemmas occurrences without
             using frequency as the main factor in counting. The results of the study suggest usable language coverage. The computed
             coverage for the first corpus was 27.25%, while the second corpus yields 24.34%. The second method yields 1.899%
             coverage for the first corpus and 6.05% for the second corpus.

             Keywords
             Bosnian lexicon, corpus, sentiment analysis, AnAwords, stopwords


1. Introduction
Sentiment analysis (or opinion mining) is a technique used to determine whether data, often performed as
text, is positive, negative, or neutral. The growing interest in the efficient analysis of web texts has led to
remarkable developments in the field of sentiment analysis. Sentiment analysis combines natural language
processing (NLP) and machine learning techniques to assign weighted sentiment scores to the entities
within a sentence or phrase. Social networks enable users to express their thoughts and feelings more
openly than ever before; sentiment analysis is becoming an essential tool to monitor and understand that
sentiment.
   In this paper, we present coverage of the first Bosnian sentiment annotated lexicon using two reference
corpora. The results of the study suggest usable language coverage. We applied two different approaches,
where the first yielded 27.25% of the coverage and the second around 1.9%. The main reason of the
difference in results lies in the fact that for the first approach we used lemmas together with their frequencies,
while in the second approach the frequencies of the given lemmas were neglected.
   Section 2 presents State of the Art, where we have gave an insight about what have been done in area of
NLP in Bosnian language, sentiment analysis as well as lexicon and corpus construction. The methodology;
process of cleaning corpora, covering corpora by the lexicon, and also usage of the stop-word lists and
intensifiers in all were described in section 3. Stages of the annotated process have been also explained
in section 3, while in section 4, which is an extension of section 3, we have presented all results of the
experiment. Last section is reserved for conclusion and further work.


2. State of the Art
There has been quite extensive research in the area of Sentiment analysis and many types of models and
algorithms have been proposed depending on the final goal of the analysis of the interpretation of user’s
feedback and queries, such as Fine-grained Sentiment Analysis (based on polarity precision), Emotion
detection, Aspect-based Sentiment Analysis, Multilingual Sentiment Analysis. All those algorithms and
models can be divided into one of three basic classes: rule-based systems (relying on long used linguistic
methods, rules and annotated linguistic materials such as annotated lexicons), automatic (corpus-based)
systems and hybrid systems that combine properties from both previous types. In such a manner, hybrid
systems use machine learning techniques together with NLP techniques developed in computational
linguistics such as stemming, tokenization, part-of-speech tagging, parsing and lexicons.


The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing
(ALTNLP), June 7-8, Koper, Slovenia
$ jahic.sead@famnit.upr.si (S. Jahić); jernej.vicic@upr.si (J. Vičič)
   Lexicons have been widely used for sentiment analysis. One of the first-known, human-annotated
lexicons for sentiment analysis is the General Inquirer lexicon (Stone et al. [1]), which contains 11,788
English words (2291 labelled as negative and 1,915 as positive, with the rest, labelled as objective). Sentiment
lexicons exist for most Slavic languages; examples are lexicons for Bulgarian (Kapukaranov and Nakov
et al. [2]), Croatian (Glavaš et al. [? ]), Czech (Veselovská [3]), Macedonian (Jovanoski et al. [4]), Polish
(Wawer [5]), Slovak (Okruhlica [6]), Slovenian (Kadunc [7]) and Bosnian (Jahić, Vičič [8]).
   The usability of the Bosnian lexicon, on a sentiment tagging task, was already proven on a Twitter
annotation task (Jahić, Vičič [8]). It is loosely based on the Slovenian lexicon (Kadunc [7]), which consists
of words and lemmas. The lexicon creation process comprised of taking words from Slovenian lexicon
and translating them into the Bosnian language. The sentiment of the Bosnian translation was manually
checked during the translation process. This lexicon was used for measuring of coverage and it contains
1279 entries labelled as positive and 3116 as negative.
   An important question for natural language researchers, general linguists, and even teachers and students
is how much text coverage can be achieved with a certain number of lemmas from the lexicon in a given
language since the number of terms in the lexicon is by a few magnitudes smaller than the number of terms
in the corpus. Studies of vocabulary coverage have been carried out for many languages such as the German
language (J. Randall et. al [9]), where a study based on the BYU/Leipzig Corpus of Contemporary German
has shown that a basic vocabulary of 3,000 high-frequency words can account for between 75% and 90%
of the words in the text; moreover in Spanish language (Davies M. et. al [10]) is stated that for the language
learner it is enough to know basic 4000 words in order to cover/recognize more than 90% of the words
in the native text. Ortiz, Hernández et. al [11] presented Lingmotiflex: a wide-coverage, domain-neutral
lexicon for sentiment analysis in English, stated that it achieves significantly better performance than the
other lexicons for English, where coverage goes up to 75% and 84% (F1-score) for two data-sets.
   Bučar, J., Žnidaršič, M. & Povh, J. [12] introduce new language resources (corpora, annotations and
lexicon) for sentiment analysis in Slovene. They retrieved more than 250,000 news items from five Slovene
web media resources. Five different measures of correlation were used to evaluate the process of annotation.
In general, all the measures indicate good internal consistency at all levels of granularity; however, their
values decrease steadily when applied to the paragraph and sentence levels.
   At least to the author’s knowledge, there were almost no attempts carried in the NLP for the Bosnian
language, so this paper presents one of the first steps in NLP for the Bosnian language and puts this
language side by side with all other world languages. Although the authors are aware that the method is
not perfect, the reference corpus is used as a normative representing the live language to the best possible
extent.
   Corpus-based model methods and lexicon-based model methods have been increasingly used to compare
language usage. A comparison of hundreds of thousands or millions of words/lemmas from the corpus with
a few thousand words/lemmas from the lexicon presents one of the main types of corpus comparison. In
this case, we refer to the corpus as ’normative’ since it provides a standard against which we can compare.
As it is stated in [13] it is possible to compare three or more corpora at the same time, but it will only make
the results more difficult to interpret.


3. Methodology and work
Bosnian sense annotated lexicon is presented and analyzed in this paper. Since also words that were
not classified as positive or negative, the notation ’sense lexicon’ was used to declare lexicon propose.
Moreover, ’sense’ is used to declare that the lexicon consists of the “core” lexicon (positive and negative
words), a list of stop-words, and a list of AnAwords (Affirmative and Non-affirmative words) which is
clarified in Fig 1.
   Two corpora were used to test language coverage:
    • the Bosnian web corpus bsWaC 1.1 [14] which is a collection of web pages crawled in 2016. The
      corpus consists of texts in three languages (Bosnian, Croatian and Serbian), each text belongs (is
      tagged) to one of the languages. The corpus is also morph-syntactically annotated and lemmatized.
      It consists of more than 285 million words. At the moment of writing, this corpus was the de-facto
      reference corpus for Bosnian language
    • Bosnian news corpus 2021 bsNews 1.0 [15], which is a collection of web news articles crawled at
      the start of 2021. The corpus contains a balanced set of at most 2000 most recent news articles from
       Figure 1: Construction of the Lexicon


       each identified web news portal in Bosnia and Herzegovina. The list of portals is maintained by
       Press Council in Bosnia and Herzegovina.
       The corpus contains news articles from 46 portals. This corpus was used as a contemporary and
       balanced source. The sentence tokens are morph-syntactically annotated with MULTEXT-East
       morph-syntactic annotations for Croatian, Version 6 (http://nl.ijs.si/ME/V6/). The corpus was
       morpho-syntactically annotated and lemmatized with ToTaLe [16]. It consists of more than 36
       million words.
Two different approaches are applied:
    • First, all lemmas with their frequencies were considered,
    • Second, the frequencies for lemmas were ignored.
A list of lemmas with frequencies was extracted from each corpus and cut off at 5 occurrences to avoid
clutter. The list of lemmas extracted from the first corpus Ljubešić N. and Klubička F. [14] consisted of
371385 different lemmas with frequency. The lemmas are ordered in increasing order by frequency, where
the lowest value is 5 (cutoff) (“batkovi - drumsticks” . . . ) and the highest value is 16652046 for lemma “biti -
to be”. The list of lemmas extracted from the second corpus (BsNews 1.0 corpus [15]) consisted of 101773
lemmas ordered in decreasing order, the most frequent lemma again “biti – to be” with a frequency of
2350487 and with the lowest frequency of 5 are several lemmas such as “polegnuti – lay down”.
   Not all lemmas can be included in the analysis. Symbols, equitation marks, and numbers, even if being a
part of the corpus, cannot be part of the lexicon, especially the sentiment annotated lexicon. These items
were removed from both corpora in the cleaning: some special characters that appear in Bosnian (not usual
for non-southwest Slavic group of languages); emoticons; punctuation; numbers; hyperlinks.
    • First approach was to include lemmas with their frequency in analysis (all appearance of lemmas
      was used for each corpora).


Figure 2: Process of matching lemmas from corpus with words from Lexicon


   The Figure 2 shows the procedure for checking the existence of given words from the lexicon in the
corpus. If this statement is true, if the word exists in the corpus, the value of 𝑓 𝑟𝑒𝑞(𝑙𝑒𝑥𝑖𝑐𝑜𝑛) is accumulated
for the value of the frequency of each word, otherwise the value 0 is added to 𝑓 𝑟𝑒𝑞(𝑙𝑒𝑥𝑖𝑐𝑜𝑛).
  Sum of all word’s frequencies from corpus is given as 𝑓 𝑟𝑒𝑞(𝑐𝑜𝑟𝑝𝑢𝑠) and 𝑓 𝑟𝑒𝑞(𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑𝑠) presents the
sum of all frequencies of the stop-words that appears in corpus. The coverage is counted as:

                                                    𝑓 𝑟𝑒𝑞(𝑙𝑒𝑥𝑖𝑐𝑜𝑛)
                             𝑐𝑜𝑣𝑒𝑟𝑎𝑔𝑒 =                                    ,                              (1)
                                          𝑓 𝑟𝑒𝑞(𝑐𝑜𝑟𝑝𝑢𝑠) − 𝑓 𝑟𝑒𝑞(𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑𝑠)

where all stop-words were excluded from.
    • Second approach was by using accuracy of lemmas without influence of frequency.
After pre-processing, 286095 lemmas from first and 92268 lemmas from second corpus were included in
further analysis (see Table 1).

   Table 1
   Number of lemmas left after pre-processing corpora

                                                           Corpus1    Corpus2
                               Overall number of lemmas     371385     101773
                                   Cleared lemmas           285963      92183
                                       Percent (%)          76.999      90.58


   In order to be able to compare words from the lexicon and corpus, the letters typical of the South-Slavic
languages such as "č, ć, đ, ž" have been replaced with "c, c, dj, z".
   Given that sentimental value is not at the forefront at this stage of research (we are looking for language
coverage), 1279 positive and 3116 negative words were united in a unique lexicon.
   In addition to lexicons, two other groups of words: stopwords (343 in our collection) and AnAwords
(Affirmative and Non-affirmative words), play a significant role in this process. Jahić and Vičič in [8]
pointed out that stop-words usually refer to the most common words in a language and that there is no
single universal list of stop-words used.
   Besides that, 102 words from the AnAwords list were created by Jahić and Vičič [8], and it has been
proven by Osmankadic et. al [17] that most of those words are intensifiers.
   The influence of words from the AnAwords list was also considered in the coverage of the corpus.
   The process of annotating lexicons went through several stages, and they were all based on an equation:

                                                𝐹 𝑂𝑈 𝑁 𝐷
                                                                                                          (2)
                                             𝑁 𝑂𝑇 _𝐹 𝑂𝑈 𝑁 𝐷
where FOUND presented the list of all words in corpus that were matched with words from the lexicon,
NOT_FOUND opposite.
Those stages are:
    • Simple coverage of corpus by lexicon as shown in the first stage. The stop-words were part of the
      corpus at this stage.
    • While in 1𝑠𝑡 stage stop-words were an integral part of the corpus, in the process of coverage, in
      2𝑛𝑑 stage covering of the corpus was made without them since the number of stop-words is almost
      negligible in relation to the number of elements in a corpus, a large difference in coverage in this
      stage was not expected.
    • Guided by the results of research conducted for corpus-based lexical analysis of subject-specific
      university textbooks in English by Hajiyeva K et. al [18], in the 3𝑟𝑑 stage, coverage was observed by
      the frequency distribution of lemmas.
    • In the 4𝑡ℎ stage, the question arises whether it is possible to group similar words (such as "andjeo"
      and "andjel" (angel)) and view them as a single word?! The solution to the problem Davies, M. et
      al [10] suggested grouping words according to word families. Given this possibility of grouping,
      matching functions were applied between corpus words and lexicon words.
       4. Results
       This section presents the results achieved by the two approaches described in Section 3.
       In the first approach (Coverage of corpora by using accuracy of lemmas with influence of frequency),
       𝑓 𝑟𝑒𝑞(𝑐𝑜𝑟𝑝𝑢𝑠), the sum of stop-words frequencies 𝑓 𝑟𝑒𝑞(𝑠𝑡𝑜𝑝𝑤𝑜𝑟𝑑𝑠) and the overall sum of all frequencies
       of the words from Lexicon is 𝑓 𝑟𝑒𝑞(𝑙𝑒𝑥𝑖𝑐𝑜𝑛) were computed.
       By using equation (1) coverage of the corpus1 is 27.25%, and coverage of the corpus2 is 24.34% (see table
       2).

          Table 2
          Coverage of corpora’s lemmas with word from sentiment lexicon

                                     freq(corpus)   freq(lexicon)    freq(stopwords)   COVERAGE
                         CORPUS1      197245460       43555699          37414905        27.25%
                         CORPUS2      30599375        6238242           4971734         24.34%


       The second approach (Coverage of corpora by using the accuracy of lemmas without the influence of
       frequency) was to compute the overall coverage of the corpora without using word frequencies.
       The motivation behind this approach was to count how many different lemmas from corpus are already
       present in the sentiment lexicon.
       There have been few stages in this approach.

  First stage: In this first stage, 1.199% coverage of 1𝑠𝑡 corpus and 3.21% for 2𝑛𝑑 corpus was achieved.

          Table 3
          Coverage of corpora’s lemmas with word from sentiment lexicon (without additional changes made)

                                                            Corpus1     Corpus2
                                             FOUND           3389         2866
                                           NOT_FOUND        282574        89317
                                           Coverage (%)      1.199         3.21


             In Table 3 are presented lemmas that were matched with words from lexicon (FOUND) and that
             were absent from lexicon (NOT_FOUND).
             Maximum coverage of corpora is possible if all words from the lexicon are included in corpora. It
             means that the maximum coverage for the First corpus is 1.54% and the Second corpus is 4.76%.
             On the other side, coverage of lexicon by corpora is 77.11% and 65%. It means that of 4395 words
             from the lexicon, 3389 were presented in corpus1 which produced use of 77.11% of the lexicon, and
             2866 were presented in corpus2, which led to 65.21% use of lexicon.
Second stage basically decreases the number of lemmas in corpora since all lemmas that are stop-words or
             AnAwords were removed. In this case, coverage of corpora has been increased to 1.2% and 3.22%.

          Table 4
          Coverage of corpora’s lemmas with word from sentiment Lexicon

                                                            Corpus1     Corpus2
                                             FOUND           3389         2866
                                           NOT_FOUND        282330        88994
                                           Coverage (%)       1.2          3.22


 Third stage was covering by distributing lemmas by frequency, and counting number of lemmas that were or
             were not covered by words from lexicon.

              From 50000 last lemmas from corpus1, there were about 1962 words from lexicon, which means that
              from 3389 overall words from lexicon that have been annotated by corpus1, 57.89% were included
             Figure 3: Annotated lexicon by distributed lemmas from corpus1 (left) and corpus2 (right)


             in 50000 most frequent lemmas of corpus1 (see Figure 3 (left)).
             More, of 15000 most frequent lemmas 1616 were presented in lexicon. Bear in mind that overall
             number of words from lexicon that are annotated is 2866, it gave that 56.39% all annotated words
             from lexicon are presented in 15000 most frequent lemmas from corpus2 (see Figure 3 (right)).
Forth stage: In 4𝑡ℎ stage lexicon annotation was increased to more than 1.899% for 1st corpus. Even it looks
             like it is contradiction with explanation that maximum coverage for the first corpus is about 1.54%,
             it is not. The reason for that is in the fact that get_close_matches (imported from difflib in Python)
             function was applied with cutoff 80% and 𝑛 = 1 (one possibility).

                                    𝑔𝑒𝑡_𝑐𝑙𝑜𝑠𝑒_𝑚𝑎𝑡𝑐ℎ𝑒𝑠(𝑤𝑜𝑟𝑑, 𝑝𝑜𝑠𝑠𝑖𝑏𝑖𝑙𝑖𝑡𝑖𝑒𝑠[, 𝑛][, 𝑐𝑢𝑡𝑜𝑓 𝑓 ])

             The function works in such way that all words that are almost similar (80% matching in this case)
             are considered as one word. For example: andjel (engl. angel), andjelko (engl. little angel), andjela
             (“I saw an angel”), all three words were replaced with andjeo. Number of lemmas (for corpus1)
             in corpora have decreased (see Table 5.) to number 229210, and annotation increases to 1.899%,
             which means that from 229210 lemmas from corpus1, 4271 was founded in lexicon.
             Same thing happens for corpus2, where annotation of 4101 words from lexicon in corpus2 were
             detected.

          Table 5
          Coverage of corpora’s lemmas with word from sentiment Lexicon

                                                             Corpus1    Corpus2
                                            No.of lemmas     229210       71857
                                              FOUND           4271         4101
                                            NOT_FOUND        224939       67756
                                            Coverage (%)      1.899        6.05


        Although the 3𝑟𝑑 stage presents an insight into the annotation of most frequent lemmas, for overall
      annotation most important stages were first, second, and fourth, since they produce overall coverage of
      the corpora by lexicon (see table 6).

          Table 6
          Annotation of corpora
5. Conclusion
Sentiment annotation of a lexicon and working in the field of Sentiment analysis and corpus-lexicon based
methods present new and first results for the Bosnian language. Although arguably Bosnian language is
closely related to Serbian and Croatian languages, there are subtle differences in these three languages that
are more evident from the Sentiment analysis point of view.
This paper presents the annotation of the first Bosnian sentiment lexicon that has been earlier proven on
sentimental basis. The lexicon comprises approximately 4400 words and covers more than 27% of the
lemmas in the first observed corpus (corpus1), Ljubešić N. and Klubička F. [14], and more then 24% of
lemmas in the second observed corpus (corpus2), (BsNews 1.0 corpus [15]). If the emphasis is on coverage
of different lemmas from corpus by lexicon, then coverage is 1.2% for corpus1 and 3.2% for corpus2. This
coverage will increase by applying some matching functions between corpora’s lemmas and lexicon’s
words (which was described in forth stage of second approach). In that case, the coverage raises to 1.9%
for corpus1 and 6% for corpus2.
It means that almost 97% of the lexicon was used to annotate corpus1 (132 words from the lexicon were
not found in corpus1) and more than 93% to annotate corpus2, which means that only 308 words from the
lexicon were not present in corpus2.
The results show that about a quarter of the lemmas from corpora have their sentimental value annotated
in the lexicon, which greatly helps in the sentimental annotation of the sentences (tweets or regular text).
Stop-words and AnAwords were also included in the analysis which leads to the possibility that LSAnA
group becomes a representative group for emotional words, stop-words, and intensifiers (all written in
Bosnian). The language coverage of the lexicon is comparable with State Of The Art, the values can be
compared in [11].
The focus in our future work will be on developing and improving LSAnA group. All members of the
group should be extended, which means that our expectation is to have more items/words labelled as
positive or negative in our ’core’ lexicon, as well as extending of lists of stop-words and AnA words. To
increase coverage, we will try to create a lexicon with all possible lemmas and in doing so we will contain
all the grammatical rules found by the Bosnian language itself (declination, conjugation, change of river by
gender, number and so on)
Although the process of annotation, as well as improvement of the first Bosnian lexicon, is still under
development, the results shown are comparable with results shown for other languages [10] [12].


Acknowledgments
The authors gratefully acknowledge the European Commission for funding the InnoRenew CoE project
(Grant Agreement #739574) under the Horizon2020 Widespread-Teaming program and the Republic of
Slovenia (Investment funding of the Republic of Slovenia and the European Union of the European Regional
Development Fund).


References
 [1] P. Stone, D. Dunphy, M. Smith, D. Ogilvie, The General Inquirer: A Computer Approach to Content
     Analysis, volume 4, 1966. doi:10.2307/1161774.
 [2] B. Kapukaranov, P. Nakov, Fine-grained sentiment analysis for movie reviews in Bulgarian, in:
     Proceedings of the International Conference Recent Advances in Natural Language Processing,
     INCOMA Ltd. Shoumen, BULGARIA, Hissar, Bulgaria, 2015, pp. 266–274. URL: https://aclanthology.
     org/R15-1036.
 [3] K. Veselovská, Czech subjectivity lexicon : A lexical resource for czech polarity classification, in: In
     Proceedings of the 7th international conference Slovko, Bratislava, 2013, pp. 279–284.
 [4] D. Jovanoski, V. Pachovski, P. Nakov, Sentiment analysis in Twitter for Macedonian, in: Proceedings
     of the International Conference Recent Advances in Natural Language Processing, INCOMA Ltd.
     Shoumen, BULGARIA, Hissar, Bulgaria, 2015, pp. 249–257. URL: https://aclanthology.org/R15-1034.
 [5] A. Wawer, Extracting emotive patterns for languages with rich morphology, International Journal of
     Computational Linguistics and Applications 3 (2012) 11–24.
 [6] A. Okruhlica, Slovak sentiment lexicon induction in absence of labeled data, Master’s thesis, Comenius
     University Bratislava, 2013.
 [7] K. Kadunc, Določanje sentimenta slovenskim spletnim komentarjem s pomočjo strojnega učenja
     (2016). URL: https://repozitorij.uni-lj.si/IzpisGradiva.php?lang=eng&id=91182.
 [8] S. Jahić, J. Vičič, Determining Sentiment of Tweets Using First Bosnian Lexicon and (AnA)-Affirmative
     and Non-affirmative Words, Springer International Publishing, Cham, 2021, pp. 361–373. URL: https:
     //doi.org/10.1007/978-3-030-54765-3_25. doi:10.1007/978-3-030-54765-3_25.
 [9] R. L. Jones, An analysis of lexical text coverage in contemporary German, Brill, Leiden,
     The Netherlands, 2006, pp. 115 – 120. URL: https://brill.com/view/book/edcoll/9789401202213/
     B9789401202213-s010.xml. doi:https://doi.org/10.1163/9789401202213_010.
[10] M. Davies, Vocabulary range and text coverage. insights from the forthcoming routledge frequency
     dictionary of spanish, in: Selected Proceedings of the 7th Hispanic Linguistics Symposium, 2005, pp.
     106–115.
[11] A. Moreno-Ortiz, C. Pérez-Hernández, Lingmotif-lex: a wide-coverage, state-of-the-art lexicon for
     sentiment analysis, in: Proceedings of the Eleventh International Conference on Language Resources
     and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan,
     2018. URL: https://aclanthology.org/L18-1420.
[12] J. Bučar, M. Žnidaršič, J. Povh, Annotated news corpora and a lexicon for sentiment analysis in slovene,
     Language Resources and Evaluation 52 (2018) 895–919. doi:10.1007/s10579-018-9413-3.
[13] P. Rayson, R. Garside, Comparing corpora using frequency profiling, in: Proceedings of the Workshop
     on Comparing Corpora - Volume 9, WCC ’00, Association for Computational Linguistics, USA, 2000,
     p. 1–6. URL: https://doi.org/10.3115/1117729.1117730. doi:10.3115/1117729.1117730.
[14] N. Ljubešić, F. Klubička, bs,hr,srWaC - web corpora of Bosnian, Croatian and Serbian, in: Proceedings
     of the 9th Web as Corpus Workshop (WaC-9), Association for Computational Linguistics, Gothenburg,
     Sweden, 2014, pp. 29–35. URL: https://aclanthology.org/W14-0405. doi:10.3115/v1/W14-0405.
[15] J. Vičič, Bosnian news corpus 2021, 2021. URL: http://hdl.handle.net/11356/1406, slovenian language
     resource repository CLARIN.SI.
[16] T. Erjavec, C. Ignat, B. Pouliquen, R. Steinberger, Massive multi lingual corpus compilation: Acquis
     communautaire and totale, Archives of Control Sciences 15 (2005).
[17] M. Osmankadić, A Contribution to the Classification of Intensifiers in English and Bosnian, Institut
     za jezik, 2003.
[18] K. Hajiyeva, A corpus-based lexical analysis of subject-specific university textbooks for english
     majors, Ampersand 2 (2015) 136–144. doi:10.1016/j.amper.2015.10.001.

</pre>