=Paper= {{Paper |id=Vol-2606/2paper |storemode=property |title=What's in the News? Identification of Trending Topics in Alternative and Mainstream Lithuanian Media |pdfUrl=https://ceur-ws.org/Vol-2606/2paper.pdf |volume=Vol-2606 |authors=Justina Mandravickaitė,Monika Briedienė,Jonas Uus,Tomas Krilavičius |dblpUrl=https://dblp.org/rec/conf/twsdetection/MandravickaiteB20 }} ==What's in the News? Identification of Trending Topics in Alternative and Mainstream Lithuanian Media== https://ceur-ws.org/Vol-2606/2paper.pdf
What’s in the News? Identification of Trending Topics in
   Alternative and Mainstream Lithuanian Media

Justina Mandravickaitė1, Monika Briedienė1,2[0000-0001-6165-1702], Jonas Uus1 and Tomas
                          Krilavičius1,2[0000-0001-8509-420X]
        1 Baltic Institute of Advanced Technology, Pilies str. 16, Vilnius 01124, Lithuania
         2 Vytautas Magnus University, K. Donelaičio str. 58, Kaunas 44248, Lithuania

                                          justina@bpti.lt



         Abstract. It is no longer surprising that internet media is a significant appliance
         in reflecting and shaping public opinion. Tracking topics dynamics and focus in
         different media channels is an important tool for opinion-forming mechanisms
         and process analysis. Information collect, text analytics and Artificial Intelli-
         gence tools allows identification of trending topics in different media sources,
         while exploratory visual analytics tools provide means to identify prevalence of
         topics in different sources, and their dynamics. In this paper we discuss an ongo-
         ing research and demonstrate applicability of such approach to main Lithuanian
         news portal (delfi.lt) and alternative unconventional media channels – sarmatas.lt
         and netiesa.lt.

         Keywords: Topic modelling, Framing, Media Monitoring, NLP, Lithuanian
         language, Artificial Intelligence, LDA, stm.


1        Introduction

Internet media is an important tool in reflecting and shaping public opinion. Modern
tools and technologies allow automatic tracking and comparing dynamics of different
topics in different media channels, and analysis of the results using visual tools. We
apply a set of such tools for the two types of Lithuanian news portals: main WWW
news channel - delfi.lt1 and two alternative unconventional media channels - sarma-
tas.lt2 and netiesa.lt3. We apply topic modelling methods for (trending) topics identifi-
cation, and visual results for the further analysis.
   Topic modelling is a text mining technique to discover common topics in a collection
of documents. In practice researchers attempt to fit appropriate model parameters to the
data corpus using one of several heuristics for maximum likelihood fit.



1   https://www.delfi.lt/, last accessed 2020/03/15
2   http://www.sarmatas.lt/, last accessed 2020/03/15
3   http://netiesa.lt/, last accessed 2020/03/15

  Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
2


    Media Framing Dynamics of the ‘European Refugee Crisis’ is analyzed in [1]. This
study investigates the national media discourses in Hungary, Germany, Sweden, the
United Kingdom and Spain for this time period. LDA was applied 130,042 articles in
5 languages from 24 news outlets. It shows that country-specific media tracks the over-
all course of the refugee debate, uncovers dynamics and shifts in discourses.
    Turkish news analysis is presented in [2]. The dataset consists of 4200 Turkish news
titles belonging to 7 classes. NMF was the most successful method for three classes,
while for five and seven classes LSA was the most successful method. Comparative
study is presented in [3] as well.
    There is an interesting study of topic modelling of news articles for two consecutive
elections in South Africa [4]. Articles are classified using pairwise cosine similarity to
identify similar topics in different periods of elections.
    Critical evaluation of the utility of the thematic grouping of texts into ‘topics’ emerg-
ing from a large collection of online patient comments about the National Health Ser-
vice (NHS) in England is presented in [5]. Results show that topic modelling allowed
to group texts into topics that were truly thematically coherent with a mixed degree of
success, while the more traditional approaches to discourse analysis consistently pro-
vided a more nuanced perspective on the data which was ultimately closer to the reality
of the texts it contains.
    In [6] paper, authors describe their work in developing a model for topic modelling
and detection of hot topics being discussed in the local Malay news publisher. This
model explored different features for article clustering and topic modelling, and then
applied the TextRank algorithm to identify hot topics in the news.
    The tremendous growth of social media content on the Internet has inspired the de-
velopment of the text analytics to understand and solve real-life problems. Leveraging
statistical topic modelling helps researchers in better comprehension of textual content
as well as provides useful information for further analysis.
    Authors [7] have tested Dengue epidemics tracking using Twitter content classifica-
tion and topic modelling. Classifier achieves a prediction accuracy of about 80 % based
on a small training set of about 1,000 instances, but the need for manual annotation
makes it hard to track seasonal changes in the nature of the epidemics, such as the
emergence of new types of virus in certain geographical locations. In contrast, LDA-
based topic modelling scales well, generating cohesive and well-separated clusters from
larger samples.
    Another experiment with Twitter data set on topic modelling was for identification
of vaccine reactions. The study [8] compared Gensim LDA, MALLET, and
jLDADMM DMM models to determine the most effective model for detecting vaccine
safety signals, assisted by an evaluation process that used an adjusted F-Scoring tech-
nique over a labelled subset of the documents.
    Paper [9] uses 18,552 tweets dated from 2015 up to 2018 to analyze the dynamics of
the LGBT conversation among Indonesian peoples. In this research, they explore the
main topic of the LGBT conversation using LDA. The result shows that there are seven
main categories that people normally talked about regarding LGBT.
                                                                                         3


   Study [10] summarizes the message content of four data sets of Twitter messages
relating to challenging social events in Kenya. They use LDA topic modelling to ana-
lyze the content. This study uses two evaluation measures: Normalized Mutual Infor-
mation (NMI) and topic coherence analysis, to select the best LDA models. The ob-
tained LDA results show that the tool can be effectively used to extract discussion top-
ics and summarize them for further manual analysis.
   Investigations can be done with short texts as well. [11] conduct a topic modelling
of 6854 Instagram posts made by Ramzan Kadyrov (the head of the autonomous Che-
chen Republic in the Russian Federation). Researchers analyze the verbal framing of
24 dominant topics. The study concludes that the main rhetorical device that Kadyrov
employs is a merging of personal and political themes throughout his posts.


2       Data and Methods

2.1     Corpora

Corpus consists of 5000 delfi.lt articles (a random sample from News category of
delfi.lt corpus [12]), 1145 sarmatas.lt articles and 2411 netiesa.lt articles, both pub-
lished in a period of 2014 – 2016 years. Delfi.lt is the mainstream news portal, the most
readable and visited channel in Lithuania, while sarmatas.lt and netiesa.lt are alternative
source of media in selected geographical indication. Sarmatas.lt is one of the most im-
portant sources in terms of dissemination of information (project Research Meadow4,
2014) and netiesa.lt is unconventional but quite popular news portal among Lithuanian
portal readers.


2.2     Methods

    Topic analysis is a Natural Language Processing (NLP) technique that allows auto-
matically extract meaning from texts by identifying recurrent themes or topics. The
goal of the structural topic model is to discover topics and estimate their relationship to
document metadata. LDA is a particularly popular method for fitting a topic model [13].
It treats each document as a mixture of topics and handles each topic as a mixture of
words [13]. This allows documents to "overlap" with content, rather than grouping them
in a way that reflects the normal use of natural language.
    The structural topic model allows researchers to flexibly estimate a topic model that
includes document-level metadata [14]. Estimation is accomplished through a fast var-
iation approximation. In this research the stm package [14] was used, it provides many
useful features, including rich ways to explore topics, estimate uncertainty, and visual-
ize quantities of interest. Structural topic modeling operating principle:
    1. The generative model begins at the top, with document-topic and topic-word
         distributions generating documents that have metadata associated with them;
       • a topic is defined as a mixture over words where each word has a probability
            of belonging to a topic.

4   http://mokslopieva.lt/, last accessed 2020/03/15
4


       •    a document is a mixture over topics, meaning that a single document can be
            composed of multiple topics. As such, the sum of the topic proportions across
            all topics for a document is one, and the sum of the word probabilities for a
            given topic is one.
    2. Topical prevalence refers to how much of a document is associated with a topic
         (described on the left hand side) and topical content refers to the words used
         within a topic (described on the right hand side). Hence metadata that explain
         topical prevalence are referred to as topical prevalence covariates, and variables
         that explain topical content are referred to as topical content covariates
    In this work, the R [15] package stm [14] for structural topic modeling was used.


2.3        Overall Process

    We used the following process for the analysis:

1. corpora were collected from the corresponding portals (not part of this research);
2. corpora were created from the random sample from delfi.lt and selected sarmatas.lt,
   netiesa.lt articles;
3. all texts were lemmatized and lowercased using SpaCy5 Core Lithuania models;
4. stopwords6, numbers, symbols and punctuation marks were removed;
5. documents were represented as bag-of-words (a text is represented as the bag (mul-
   tiset) of words, disregarding grammar and even word order but keeping frequen-
   cies.);
6. low frequency words (5% of the least frequent words in the whole corpora) and 5%
   of words that occurred in all the texts were removed;
7. Latent Dirichlet Allocation (LDA) [16] and stm R function [14] were applied for
   structural topic modelling;
8. results were visualized.


3          Results

Topic modeling is part of a class of text analysis methods that analyze “bags” or groups
of words together—instead of counting them individually–in order to capture how the
meaning of words is dependent upon the broader context in which they are used in
natural language. So foremost investigation was for finding the expected proportions in
the data (see Fig. 1).




5   https://spacy.io/, last accessed 2020/03/15
6   https://github.com/tokenmill/ltlangpack, last accessed 2020/03/15
                                                                                     5




  Fig. 1. Topics by expected proportions in the data.

   After this approach we have to select and take into account the words with the high-
est (raw) probabilities and the highest FREX (Frequency and Exclusivity, i.e., words
that are most frequent and exclusive to the topic) (see Fig. 2).
6




Fig. 2. The words with the highest (raw) probabilities and the highest FREX.

   We apply LDA for delfi.lt and sarmatas.lt dataset. Analysis shows rather different
combination of topics in the portals, see Fig. 3. In delfi.lt (left) orange topics (democ-
racy, traffic accidents, referendum on preventing foreigners from owning land in Lith-
uania, ceasefire negotiations in Ukraine, activities of the state security department of
Lithuania, etc.) prevail, while in sarmatas.lt (right) blue-purple topics (Islam and ter-
rorism, industry, Maidan, taxes, migrants and refugees, etc.) are significant part of con-
tent. Summaries of identified topics (highly probable words) were assigned by experts
after qualitative analysis.




Fig. 3. LDA topic prevalence and distribution in delfi.lt (left) and sarmatas.lt (right).

   Interpretability of topics built by topic modeling is an important issue for researchers
applying this technique. Our investigation showed that higher semantic coherence in-
dicates topics that have more consistent words (more interpretable) while exclusivity
                                                                                             7


measures how exclusive the words are to the topic relative to other topics (e.g. low
values mean topics that are vague and share a lot of words with other topics while high
values indicate words that are very unique/exclusive to the topic) (see Fig. 4).




  Fig. 4. Topic interpretability: the exclusivity and semantic coherence (X axis represents se-
mantic coherence, Y axis – exclusivity).

   After examination of the whole set, we focused on the distribution of topics across
different media channels. The stm is a general framework for topic modeling with doc-
ument-level covariate information. The covariates can improve inference and qualita-
tive interpretability and are allowed to affect topical prevalence, topical content or both.
The software package implements the estimation algorithms for the model and also
includes tools for every stage of a standard workflow from reading in and processing
raw text through making publication quality figures. Topical prevalence refers to how
much of a document is associated with a topic and topical content refers to the words
used within a topic. Expected difference in topic probability be media type (with 95 %
confidence intervals) is shown below (see Fig. 5 and Fig. 6).




Fig. 5. Effect of media Type on Topic Prevalence in 2014-2016.
8




Fig. 6. Effect of media Type on Topic Prevalence in 2014-2016.

   Following examining the distribution of all topics, we focused our research on key
sensitive topics. We find that the model captures important events and differences be-
tween different media channel’ depictions of these events (see Annex 1).
   Topic correlation network creation results are depicted in Fig. 7. The way these al-
gorithms work is by assuming that each document is composed of a mixture of topics,
and then trying to find out how strong a presence each topic has in a given document.
This is done by grouping together the documents based on the words they contain, and
noticing correlations between them. A topic model captures this intuition in a mathe-
matical framework, which allows examining a set of documents and discovering, based
on the statistics of the words in each, what the topics might be and what each docu-
ment's balance of topics is.
                                                                                                9




   Fig. 7. Delfi.lt, Sarmatas.lt and Netiesa.lt Topic (Correlation) Network. Explanation: Blue –
more typical to unconventional media; Red – more typical to mainstream media; Black – topics
that differ delfi.lt (mainstream) and alternative/unconventional (sarmatas.lt and netiesa.lt) news
sources most.

   Topic models have become a standard tool within quantitative text analysis for many
different reasons. Topic models can be much more useful than simple word frequency
or dictionary based approaches depending upon the use case. Topic models tend to pro-
duce the best results when applied to texts that are not too short (e.g. tweets), and those
that have a consistent structure.


4       Conclusion and Future Plans

We discussed an ongoing research of: (1) text analytics and Artificial Intelligence tools
to identify trending topics in different media sources; (2) exploratory visual analytics
10


tools to identify prevalence of topics in different sources & their dynamics. We demon-
strated the applicability of such approach to mainstream Lithuanian news portal
(delfi.lt) and two alternative/unconventional media channels – sarmatas.lt and
netiesa.lt. Early stage analysis shows considerable difference of prevalent topics in dif-
ferent media channels, which allows identifying targets of the channel.
   We plan to extend research to wider set of media sources, change of topics in time
(more detailed) and relations between topics and media channels (more detailed).


References
1.  Heidenreich, T., Lind, F., Eberl, J. M., Boomgaarden, H. G.: Media Framing Dynamics of
    the ‘European Refugee Crisis’. A Comparative Topic Modelling Approach, Journal of Ref-
    ugee Studies, 32(1), i172–i182 (2019), https://doi.org/10.1093/jrs/fez025, last accessed
    2020/03/15.
2. Güven, Z. A., Diri, B., Çakaloğlu, T.: Comparison of Topic Modeling Methods for Type
    Detection of Turkish News. In: 4th International Conference on Computer Science and En-
    gineering (UBMK), pp. 150-154, Samsun, Turkey (2019).
3. Kherwa, P., Bansa,l P.: Topic Modeling: A Comprehensive Review, SIS, EAI (2019), doi:
    10.4108/eai.13-7-2018.159623, last accessed 2020/03/15.
4. Moodley, A., Marivate, V.: Topic Modelling of News Articles for Two Consecutive Elec-
    tions in South Africa. In: 6th International Conference on Soft Computing & Machine In-
    telligence (ISCMI), pp. 131-136, Johannesburg, South Africa (2019).
5. Brookes, G., McEnery, T.: The utility of topic modelling for discourse studies: A critical
    evaluation’. Discourse Studies 21(1), 3–21 (2019), doi: 10.1177/1461445618814032, last
    accessed 2020/03/15.
6. Weiying, K., Pham, D.N., Hai, N.C., Ong, H. H.:Topic Modelling for Malay News Aggre-
    gator. In: Fourth International Conference on Advances in Computing, Communication &
    Automation (ICACCA), pp. 1-6, Subang Jaya, Malaysia (2018).
7. Missier P. et al.: Tracking Dengue Epidemics Using Twitter Content Classification and
    Topic Modelling. In: Casteleyn S., Dolog P., Pautasso C. (eds) Current Trends in Web En-
    gineering, ICWE 2016, Lecture Notes in Computer Science, vol 9881, Springer, Cham
    (2016).
8. Habibabadi, S. K., Haghighi, P. D.: Topic Modelling for Identification of Vaccine Reactions
    in Twitter. In: Proceedings of the Australasian Computer Science Week Multiconference
    (ACSW 2019), Association for Computing Machinery, New York, NY, USA, Article 31,
    1–10 (2019), https://doi.org/10.1145/3290688.3290735, last accessed 2020/03/15.
9. Arslina, A., Liebenlito, M.: Sequential Topic Modelling: A Case Study on Indonesian
    LGBT Conversation on Twitter. In: Prime: Indonesian Journal of Pure and Applied Mathe-
    matics, 1. 10.15408/inprime.v1i1.12726 (2019).
10. Sokolova, M., Huang, K., Matwin, S., Ramisch, J., Sazonova, V., Black, R., Orwa, C.,
    Ochieng, S., Sambuli, N.: Topic Modelling and Event Identification from Twitter Textual
    Data (2016).
11. Rodina E., Dligach D.: Dictator’s Instagram: personal and political narratives in a Chechen
    leader’s social network. Caucasus Survey, 7(2), 95-109 (2019), doi:
    10.1080/23761199.2019.1567145, last accessed 2020/03/15.
                                                                                              11


12. Bielinskienė, A., Boizou, L., Bumbulienė, I., Kovalevskaitė, J., Krilavičius, T., Mandra-
    vickaitė, J., Rimkutė, E., Vilkaitė-Lozdienė, L.: DELFI.lt corpus, Vilnius, Lithuania (2019),
    https://www.clarin.vdu.lt/xmlui/handle/20.500.11821/30, last accessed 2020/03/15.
13. Silge J., Robinson D. Text Mining with R– A Tidy Approach. O'Reilly Media, Sebastopol,
    California, USA (2017).
14. Roberts M. E., Stewart B. M., Tingley D.: Stm: An R Package for Structural Topic Models.
    Journal of Statistical Software 91(2), 1-40 (2019), doi: 10.18637/jss.v091.i0, last accessed
    2020/03/15.
15. R Core Team: R: A language and environment for statistical computing. R Foundation for
    Statistical Computing, Vienna, Austria (2014), http://www.R-project.org/, last accessed
    2020/03/15.
16. Blei, D. M., Lafferty, J. D.: Topic models. In Text mining, pp. 101-124, Chapman and
    Hall/CRC, Boca Raton (2009).
12


     Annex 1
13
14
                                              15




Explanation:
    •    Blue – mainstream media portal;
    •    Red – unconventional media portal;
    •    Line -- expected probabilities;
    •    Dash line – sample median.