<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. Tungthamthiti, K. Shirai, M. Mohd, Recognition of sarcasm in microblogging based on
sentiment analysis and coherence identification, J. Nat. Lang. Process.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.5715/jnlp.23.383</article-id>
      <title-group>
        <article-title>using transformer models and automatic parsing⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vysotska</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrian</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hyriak</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lyubomyr Chyrun</string-name>
          <email>lyubomyr.v.chyrun@lpnu.ua</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rostyslav Fedchuk</string-name>
          <email>rostyslav.b.fedchuk@lpnu.ua</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariia</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>D. Serikbayev East Kazakhstan University</institution>
          ,
          <addr-line>D. Serikbayev STR., 19, 070004 Ust-Kamenogorsk</addr-line>
          ,
          <country>The Republic of Kazakhstan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hetman Petro Sahaidachnyi National Army Academy</institution>
          ,
          <addr-line>Heroes of Maidan 32, 79026 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ivan Franko National University</institution>
          ,
          <addr-line>Universytetska Street 1, Lviv, 79000 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>S. Bandera 12, 79013 Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Oleksandr Lavrut</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>West Ukrainian National University</institution>
          ,
          <addr-line>Lvivska Street 11, 46004 Ternopil</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>Yuriy Fedlovyvh Chernivtsi National University</institution>
          ,
          <addr-line>Kotsiubynskoho Street 2, 58012 Chernivtsi</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>23</volume>
      <issue>5</issue>
      <fpage>383</fpage>
      <lpage>405</lpage>
      <abstract>
        <p>The article presents a comprehensive system for the automated analysis of comments on Instagram, focusing on multilingual content and the unique characteristics of social networks. The system includes a module for automatic parsing of dynamic content, an algorithm for determining the language of comments, and mood analysis modules built on the basis of modern transformer models, in particular XLM-RoBERTa. Particular attention is paid to supporting Ukrainian, Russian, and English, as well as processing texts with informal elements, including slang, abbreviations, emojis, and symbols. An approach to analysing mood dynamics over time by combining models of time series, moving averages, and clustering is proposed. The system is complemented by interactive visualisation of results, which enables researchers and businesses to gain in-depth insights from large amounts of data. The analysis of existing solutions demonstrates the advantages of the proposed approach, particularly its high accuracy for local languages and its adaptation to social media content. The developed tool is crucial for monitoring public sentiment, gathering business intelligence, and enhancing information security, particularly in the Ukrainian context.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentiment analysis</kwd>
        <kwd>social media</kwd>
        <kwd>Instagram</kwd>
        <kwd>multilingualism</kwd>
        <kwd>transformer models</kwd>
        <kwd>XLM-RoBERTa</kwd>
        <kwd>automatic parsing</kwd>
        <kwd>language detection</kwd>
        <kwd>time series</kwd>
        <kwd>natural language processing (NLP)</kwd>
        <kwd>data visualisation</kwd>
        <kwd>emotional analysis</kwd>
        <kwd>comment trends</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today's world, social media has become a powerful tool for communication, marketing, opinion
analysis, and research into consumer sentiment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Instagram, as one of the most popular
platforms, generates millions of comments daily that contain valuable information for businesses,
academics, NGOs, and governments. However, it is not possible to process such a volume of data
manually, and existing automated solutions have significant limitations, especially in the context of
multilingualism, mood specificity, and local contexts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Most modern solutions for analysing
comments on social networks focus on English-language content, overlooking multilingualism and
      </p>
      <p>
        0000-0001-6417-3689 (V. Vysotska); 0009-0007-4948-4586 (A. Hyriak); 0000-0002-9448-1751 (L. Chyrun);
0009-00026669-0369 (R. Fedchuk); 0000-0002-4909-6723 (O. Lavrut); 0000-0003-4858-4511 (D. Uhryn); 0000-0002-9690-8042
(L. Kolyasa); 0000-0002-8411-3584 (S. Smailova); 0000-0002-1101-7479 (M. Brygadyr)
the distinct characteristics of language groups, such as Ukrainian and Russian [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Problems with
existing solutions [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4–6</xref>
        ]:
      </p>
      <sec id="sec-1-1">
        <title>1. Insufficient support for the Ukrainian language. 2. The difficulty of working with multilingual content. 3. Lack of in-depth trend analysis. 4. Limited in specialised analysis.</title>
        <p>
          In the context of Ukraine, there is a growing need to analyse local content, including Ukrainian
and Russian. Existing solutions lack high accuracy due to the absence of specialised models for
these languages [
          <xref ref-type="bibr" rid="ref2 ref3 ref4">2–4</xref>
          ]. In real-world conditions, comments often contain text in multiple languages,
symbols, emojis, and abbreviations, making them challenging to analyse [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Existing systems do
not account for the dynamics of mood changes over time, linguistic features, or the solidarity of
comments with the initial post [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Most solutions do not utilise modern advancements in the field
of transformers, particularly specialised models that can provide accurate sentiment analysis for
various languages [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The proposed solution, which includes automatic parsing, multilingual
sentiment analysis using specialised transformer models, and the construction of a comprehensive
trend analysis, is highly relevant to Ukraine for several reasons [
          <xref ref-type="bibr" rid="ref2 ref3 ref4">2–4</xref>
          ]:
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1. Development of the Ukrainian IT sector. 2. Support of public opinion. 3. A tool for business. 4. Promotion of scientific research.</title>
        <p>The creation of specialised models for the Ukrainian language will contribute to the
development of local technologies and increase the competitiveness of Ukrainian companies in the
global market. In times of war and post-conflict reconstruction, sentiment analysis on social media
can be helpful in monitoring public opinion, detecting disinformation, and evaluating the
effectiveness of information campaigns. Businesses will be able to receive more accurate analytics
on reactions to their products or services, taking into account local language features and nuances.
The proposed solution opens up new opportunities for research in the field of natural language
processing, mood analysis and sociology. Thus, the proposed project is not only relevant but also
critically important for the development of text analysis technologies in social networks within a
multilingual environment, particularly in the context of Ukraine. It allows you to address existing
limitations by providing accurate, contextual and multilingual analysis of comments, which is
essential for both business and society as a whole.</p>
        <p>The purpose of the study is to develop a comprehensive system of automated analysis of
comments on Instagram, which provides multilingual text processing, sentiment identification,
trend building and in-depth analysis of the interaction of comments with posts, taking into account
the specifics of language groups and the context of social networks, to increase the efficiency of
decision-making by businesses, researchers and organisations.</p>
        <p>To achieve this goal, it is necessary to solve the following tasks:
1. Development of a mechanism for automatic parsing of comments from Instagram: ensure
efficient retrieval of data from posts, including comments, emojis, and symbols; consider
the technical limitations of the Instagram API and optimise for high scraping performance.
2. Definition of comment language: implement an algorithm to automatically detect the
language of comments, taking into account multilingualism, abbreviations, slang and
symbols, and add labelling of comments (e.g. "ua", "ru", "en", "symbol_only") for further
analysis.
3. Development of a mechanism for analysing the sentiments of comments: integrate
specialised pre-trained transformer models from the Hugging Face platform to analyse the
sentiments of comments in different languages, and to ensure high accuracy of analysis,
adapted to the specifics of texts from social networks.
4. Building a comprehensive analysis of the results: develop algorithms to assess the
solidarity of comments with posts, determine the general mood, and build trends for
maintaining posts over time, and perform an analysis of the sentiment of comments based
on the language group.
5. Visualisation of data and analysis results: develop tools for visualising results in the form
of graphs, charts, and interactive reports, and to provide the possibility of segmented
analysis (by language, time, mood, etc.).
6. Testing and optimisation of the system: test the system on real data from Instagram to
check its performance and accuracy, and optimise the algorithm to ensure efficient
processing of large datasets.</p>
        <p>Thus, the implementation of these tasks will create an innovative solution for the multilingual
analysis of comments in social networks, which will be useful for businesses, researchers, and
organisations, particularly in Ukraine, and will contribute to the development of natural language
processing technologies. The object of this study is the automated processing and analysis of text
data generated by users in social networks, specifically Instagram comments that contain
multilingual content, symbols, emojis, and other text features. The subject of the study is methods
and algorithms for automatic parsing, multilingual text processing, sentiment determination, and
analysis of comment trends in social networks, particularly specialised transformer models for
analysing texts in different languages, as well as approaches to visualising results to provide a deep
understanding of user interaction with content. Within the framework of the study, the following
new scientific provisions and solutions were obtained, which differ from the previously known
ones and have the subsequent degree of novelty:
1. For the first time, an approach to analysing multilingual comments on Instagram has been
developed, taking into account the specifics of the Ukrainian, Russian, and English
languages, as well as content consisting only of characters (symbol_only). An algorithm for
automatically determining the language of comments is proposed, taking into account the
features of texts in social networks, such as slang, abbreviations, emojis and symbols.
2. For the first time, specialised pre-trained transformer models have been integrated to
analyse the sentiments of comments in various languages. Adaptation of models from the
Hugging Face platform for text analysis in the multilingual environment of Instagram, in
particular for the Ukrainian language, which was previously underrepresented in existing
solutions.
3. The process of analysing the solidarity of comments with posts has been improved. A new
approach has been developed to assess the level of support or criticism of comments
regarding the content of the initial post, enabling a more accurate evaluation of user
interaction with the content.
4. The method of complex analysis of moods in the dynamics of time was further developed.</p>
        <p>An algorithm for identifying trends in comment sentiments based on the time of post
publication is proposed, enabling the detection of changes in audience reactions to content
over time.
5. Improved the approach to analysing comment sentiment based on language group. For the
first time, a comparative analysis of the moods of comments in different languages
(Ukrainian, Russian, English, and symbol_only) was conducted, taking into account
linguistic and cultural features.
6. A new approach to visualising the results of analysing multilingual comments has been
developed. Interactive data presentation tools have been developed that enable businesses,
researchers, and organisations to quickly gain insights from large datasets, taking into
account both linguistic and emotional characteristics.</p>
        <p>Thus, the results of the study have a high degree of novelty, as the proposed solutions provide a
more accurate, contextual and multilingual analysis of comments on social networks, which has
not been implemented in this form before, especially for the Ukrainian context.</p>
        <p>
          The developed project has significant practical value, since its implementation opens up new
opportunities for data analysis in social networks, in particular on Instagram, and can be applied in
various areas [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15">10–15</xref>
          ]:
1. Business Analytics and Marketing – monitoring consumer sentiment. Businesses have the
opportunity to analyse audience reactions to their products, services, or advertising
campaigns, taking into account the sentiments of commentators in various languages;
Evaluating the effectiveness of content. The system enables you to determine the level of
solidarity between comments and posts, which helps assess how the content resonates with
the audience; Identifying trends. Sentiment analysis, in the context of time dynamics, helps
businesses predict consumer behaviour and adapt their strategies.
2. Public opinion and sociological research – monitoring of public sentiment. The project
enables you to analyse user reactions to socially significant topics, events, or political
decisions, taking into account the multilingual nature of comments; Identification of
language features. Analysis of moods across different language groups helps to better
understand the cultural and regional characteristics of information perception.
3. Information Security and Countering Disinformation – detection of harmful content. The
system can be used to automatically detect toxic or destructive comments, which is
essential for content moderation; Monitoring of information campaigns. The tool enables
you to evaluate the effectiveness of information campaigns targeting the combat of
disinformation, particularly in the context of Ukraine.
4. Development of local technologies: Support for the Ukrainian language – the creation of
specialised models for analysing texts in Ukrainian contributes to the development of local
natural language processing technologies; Integration into the Ukrainian IT sector.
Ukrainian IT companies can utilise the project to develop new products and services
centred on social media analysis.
5. Academic Studies: Expansion of research in the field of natural language processing – the
integration of pre-trained transformer models for multilingual text analysis opens up new
opportunities for scientific research; Study of social interactions. The project can be used to
analyse interactions between users and content in social networks, which is relevant to
sociology, psychology, and media studies.
6. Tool for state bodies and public organisations – assessment of public support; government
agencies and civil society organisations can utilise the system to gauge the public's
sentiment on key social or political issues; monitoring social trends. The tool enables you
to track changes in societal moods, which is crucial for informed strategic decision-making.
7. Data visualisation for decision-making – interactive reports. The results of the analysis are
presented in a visual form (graphs, diagrams, trends), which simplifies the decision-making
process for businesses, organisations and researchers; Segmented analysis. The ability to
analyse data by language, time, mood, and other factors provides flexibility in working
with large amounts of data.
        </p>
        <p>The developed project is of particular importance to Ukraine, as it contributes to the
development of local IT solutions, supports the Ukrainian language in the technological
environment, helps analyse public sentiment in challenging socio-political conditions, and provides
tools to combat disinformation. Thus, the practical value of the project lies in its versatility,
adaptability to a multilingual environment, ability to generate deep insights from social media data,
and support for the development of technologies focused on local needs.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Comparison of the product being developed with analogues, advantages/disadvantages determining and problem formulation</title>
      <p>
        In today's world, social media, particularly Instagram, has become a powerful source of data for
analysing user sentiment, public opinion, and behaviour [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1–4</xref>
        ]. However, processing such data is a
challenging task due to its multilingual nature, unstructured format, use of symbols, emojis, and
specific slang [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5–9</xref>
        ]. To solve these challenges, it is necessary to create specialised systems that
combine automatic parsing, multilingual text analysis, sentiment determination, and trend building
over time. There are numerous solutions on the market today that focus on text analysis. Still, most
of them focus on specific aspects of the problem, such as data scraping, language definition, or
sentiment analysis [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10–12</xref>
        ]. Complex systems that integrate all these functions are mostly internal
solutions of large companies and are not freely available. It makes it difficult to directly compare
the product under development with analogues, since there are no open systems on the market
that fully comply with the proposed concept. For objective analysis, this section will compare
individual components of the system under development with existing analogues, such as tools for
data parsing (Selenium, Instaloader, Instagram API), methods for determining the language of texts
(LangDetect, FastText), models for sentiment analysis (BERT, RoBERTa, VADER), and approaches
to data visualisation. It will enable you to assess the advantages and disadvantages of each
component, justify the choice of methods and technologies for implementing the system, and
identify any remaining unresolved problems. Thus, the analytical review will aim to compare the
components of the project being developed with existing solutions, critically analysing their
effectiveness and adapting them to the specific tasks facing the project. It will enable you to
identify key tasks and create a comprehensive system that provides accurate and multilingual
analysis of texts from social networks.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Overview of existing systems for automating text analysis in social networks</title>
        <p>
          Social media is a primary source of textual data for businesses, researchers, and organisations
seeking to understand user sentiments, trends, and behaviours [
          <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12–15</xref>
          ]. To automate the analysis of
texts on social networks, several platforms and tools offer various functionalities.
        </p>
        <p>Let's review the most well-known systems, such as Google Cloud Natural Language, Microsoft
Azure Text Analytics, Brandwatch, and Sprout Social, with an emphasis on their functionality,
multilingual capabilities, sentiment analysis accuracy, integration, and visualisation features.
1. Google Cloud Natural Language API is a service from Google that allows you to analyse
text data using artificial intelligence. It supports features such as language detection,
sentiment analysis, entity recognition, and parsing. Determines the polarity (positive,
negative, neutral) and mood intensity of the text. Automatically identifies the language of
the text. Highlights key objects, such as names, places, and organisations. It supports more
than 20 languages, including English and Russian, but does not provide full support for the
Ukrainian language. The accuracy of sentiment analysis is high for English-language
content; however, it decreases for other languages, particularly for texts from social
networks that contain slang, emojis, and abbreviations. The API integrates seamlessly into
applications thanks to REST APIs and SDKs for different programming languages. It does
not provide built-in visualisation tools, but the results can be integrated with other
graphing tools. Limited support for the Ukrainian language. High cost for large amounts of
data. Google Cloud Natural Language is a powerful tool for text analysis; however, its
limitations in local languages and the specific characteristics of texts from social networks
reduce its effectiveness for multilingual analysis.
2. Microsoft Azure Text Analytics is a part of Azure Cognitive Services that provides an API
for text analysis. It supports sentiment detection, entity recognition, text classification, and
language detection. Determines the emotional tone of the text. Highlights entities and
categorises them. Automatically detects the language of the text. Supports more than 120
languages, including Ukrainian, Russian, and English. The accuracy of sentiment analysis is
high for major languages; however, it decreases for texts from social networks, particularly
those containing symbols and emojis. Easy integration via REST API and SDK for different
programming languages. It does not have built-in visualisation tools, but the results can be
used in other tools for visual representation. High cost for large amounts of data. Limited
adaptation to the specifics of social media texts. Microsoft Azure Text Analytics has
broader language support than Google Cloud Natural Language, but its accuracy for texts
from social networks remains limited.
3. Brandwatch is a social media monitoring platform that allows you to collect data, analyse
sentiment, identify trends, and generate reports. Collects data from various platforms,
including Instagram, Twitter, and Facebook. Determining the mood of texts in social
networks. Automatic identification of key trends in texts. Supports text analysis in multiple
languages, with varying accuracy levels by language. The accuracy of sentiment analysis is
high for English-language content, but may be reduced for less popular languages such as
Ukrainian. The platform offers APIs for integration with other systems. It has built-in tools
for creating graphs, charts, and reports. High cost. Closed access to sentiment analysis
algorithms. Brandwatch is an effective tool for monitoring social networks, but its
limitations in supporting local languages and high cost make it less accessible for
multilingual analysis.
4. Sprout Social is a social media management platform that also includes text analysis
features. Data collection from different platforms. Determination of the moods of texts.
Automatically generate reports on user interactions. Supports text analysis in multiple
languages, but is limited. The accuracy of sentiment analysis is high for English-language
content, but limited for other languages. Integration with other platforms via API. It has
built-in visualisation tools for generating reports. High cost. Limited support for local
languages. Sprout Social is useful for social media management, but its text analysis
functionality is basic.</p>
        <p>Existing social media text analysis systems offer a wide range of features; however, they have
limitations, including support for local languages (such as Ukrainian), the accuracy of text analysis
from social networks, and adaptation to the specifics of multilingual content. The system under
development aims to solve these problems by integrating specialised models for sentiment analysis,
multilingual analysis and adaptation to texts from social networks.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Comparison of data collection methods</title>
        <p>
          Collecting data from social networks, particularly Instagram, is a crucial step in further analysing
comments, sentiments, and trends [
          <xref ref-type="bibr" rid="ref15">15–18</xref>
          ]. However, the choice of scraping tool depends on
several factors, including the data structure, platform limitations, the tool's performance, and
compliance with privacy policies. Let's compare the most common tools for scraping data from
Instagram: Selenium, Scrapy, BeautifulSoup, Instaloader, and Instagram Graph API. Particular
attention is paid to the advantages of Selenium for dynamic comment parsing.
        </p>
        <p>1. Selenium is a browser automation tool that allows you to interact with web pages the way
a real user does. It is ideal for dynamic parsing because it can interact with page elements
generated by JavaScript. Selenium allows you to load comments that appear after clicking
the "Load more comments" button. It is possible to automate any actions on the page,
including scrolling, clicking buttons, and entering data. Using a real-time browser reduces
the risk of detecting automated actions – the ability to work with Chrome, Firefox, Edge
and other browsers. Selenium consumes a lot of RAM and computing resources due to page
rendering in the browser. Less productive compared to tools that work without a browser.
Instagram may detect automated activities, so additional measures, such as using proxies,
are required. Selenium is a powerful tool for dynamic scraping, particularly for handling
complex pages with dynamically loaded elements, such as those found on Instagram
comments.
2. Scrapy is a Python web scraping framework that allows you to quickly scrape data from
static web pages. Scrapy is much faster than Selenium because it doesn't use a browser.
Suitable for collecting large amounts of data from simple static pages. Easily integrates
with other libraries for data processing. Scrapy cannot work with JavaScript-generated
elements, such as dynamically loaded comments. To work with Instagram, you need
additional tools or workarounds. Scrapy is effective for static pages, but its limitations in
working with dynamic content make it less suitable for scraping comments from social
media platforms like Instagram.
3. BeautifulSoup is a Python library for analysing HTML and XML documents. It allows you
to extract data from web pages using a simple API. Easy to set up for basic parsing. It can
be used to work with HTML structures of any complexity. It can be combined with other
tools such as Scrapy or Selenium. BeautifulSoup cannot work with dynamically uploaded
content. Works slower than Scrapy, especially for large amounts of data. BeautifulSoup is
suitable for basic HTML page scraping, but it cannot work with dynamic content, such as
Instagram comments.
4. Instaloader is a Python tool for downloading data from Instagram, including posts, profiles,
and comments. Easily configurable for basic data collection from public profiles. Supports
logging in to the account to access private data. Can extract comments on posts. Using
Instaloader may violate Instagram's terms of service, resulting in account suspension or
ban. The tool is subject to Instagram's restrictions, which may affect the stability of the
work. Cannot interact with elements that JavaScript generates. Instaloader is helpful for
basic data collection from Instagram, but its limitations in working with dynamic content
make it less effective for scraping comments.
5. Instagram Graph API is an official tool for accessing Instagram data that provides features
to collect information about posts, comments, and other metadata. The API complies with
Instagram's privacy policy. Allows access to structured data with high accuracy. You can
obtain additional information, such as the publication date, the author of the comment, and
so on. The API only works with business accounts and creator accounts. There are strict
limits on the number of requests, which makes it challenging to work with large amounts
of data. The data is only available for public profiles. The Instagram Graph API is a reliable
and legal tool, but its limitations make it less flexible for comprehensive comment scraping.</p>
        <p>Among all the tools reviewed, Selenium is the best choice for scraping comments from
Instagram, as it enables working with dynamically loaded elements, such as the "Load more
comments" button. Although Selenium has high resource requirements and is slower, its flexibility
and ability to mimic user actions make it ideal for tasks that involve dynamic content. Other tools,
such as Scrapy, BeautifulSoup, and Instaloader, are less suitable for this task due to limitations in
working with JavaScript-generated elements. At the same time, the Instagram Graph API has strict
access restrictions.</p>
        <sec id="sec-2-2-1">
          <title>Flexibility</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Risk of blocking</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>High</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Average Average</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Medium Low Low</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Legality</title>
        </sec>
        <sec id="sec-2-2-7">
          <title>Partially</title>
        </sec>
        <sec id="sec-2-2-8">
          <title>Partially</title>
        </sec>
        <sec id="sec-2-2-9">
          <title>Instagram Graph API No No</title>
        </sec>
        <sec id="sec-2-2-10">
          <title>Average</title>
        </sec>
        <sec id="sec-2-2-11">
          <title>High</title>
          <p>Low
Low
2.3. Comparison of methods for determining the language of texts
Determining the language of a text is a key step for multilingual data analysis on social media. In
the context of Instagram, comments can be written in different languages, contain mixed language
elements, symbols, emojis, and abbreviations, making it difficult to automatically detect the
language [18–24]. We will review the most common algorithms and libraries for determining the
language of texts, such as LangDetect, FastText, and DeepLang, and their adaptation to the specific
needs of texts from social networks. Particular attention is paid to accuracy, performance and the
possibilities of integrating these methods into the system being developed.</p>
          <p>1. LangDetect is a popular text language detection library based on an algorithm that uses
statistical models and Bayesian classification. It supports more than 50 languages. Supports
most common languages, including Ukrainian, Russian, and English. Easily integrates into
Python projects – high speed of text processing, which allows you to work with large
amounts of data. LangDetect has difficulty detecting the language of short texts, such as
social media comments. Texts that contain multiple languages or characters may be
classified incorrectly. LangDetect is effective for detecting the language of long texts, but
its accuracy for short and mixed texts is limited.
2. FastText is a text classification and vectorisation library developed by Facebook AI. It also
features a model for detecting the language of the text, which supports over 170 languages.
FastText demonstrates high accuracy even for short texts. Supports a greater number of
languages than LangDetect. FastText models are designed to be fast, making them
wellsuited for processing large amounts of data. The model can be additionally trained on
specific data from social networks. To use FastText, you must have a sufficient amount of
RAM. The integration can be more complicated compared to LangDetect. FastText is one of
the best options for determining the language of short texts such as Instagram comments,
due to its accuracy and adaptability.
3. DeepLang is a text-language detection library that uses neural networks for classification.</p>
          <p>It supports more than 100 languages and is one of the most advanced technologies in this
field. The use of neural networks ensures high accuracy even for short and mixed texts.
Can work with texts that contain symbols, emojis, and abbreviations. You can additionally
train the model on specific data. DeepLang requires significant computing resources to
operate. Integration into the system requires additional effort due to the complexity of
working with neural networks. DeepLang is the best option for accurately determining the
language of texts from social networks; however, it requires significant resources to utilise.</p>
        </sec>
        <sec id="sec-2-2-12">
          <title>Adaptation to mixed texts Low</title>
        </sec>
        <sec id="sec-2-2-13">
          <title>Average High</title>
        </sec>
        <sec id="sec-2-2-14">
          <title>Speed</title>
        </sec>
        <sec id="sec-2-2-15">
          <title>Resources</title>
        </sec>
        <sec id="sec-2-2-16">
          <title>High</title>
        </sec>
        <sec id="sec-2-2-17">
          <title>High</title>
          <p>Low</p>
        </sec>
        <sec id="sec-2-2-18">
          <title>Average</title>
          <p>For a system that handles comments from Instagram, LangDetect is the best choice. However,
FastText offers high accuracy for short texts, broad language support, and the ability to adapt to
the specifics of texts from social networks. DeepLang is also a promising option for providing the
highest accuracy, but its resource requirements can be a limiting factor. LangDetect can be used for
basic tasks; it does not require significant resource expenditures on the part of the system, so it was
chosen for the project.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Comparison of Sentiment Analysis Models</title>
        <p>
          Analysing the sentiments of texts is a crucial task for understanding the emotional reaction of the
audience to specific content on social networks [
          <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1–24</xref>
          ]. This section discusses modern approaches
to sentiment analysis, including traditional vocabulary methods and modern transformer-based
models. Particular attention is paid to their accuracy for multilingual text analysis, including
Ukrainian, Russian and English. The choice of models from the Hugging Face library for
implementing the system under development will also be justified.
        </p>
        <p>1. Traditional dictionary approaches to sentiment analysis are based on the use of pre-created
dictionaries that contain words with the meanings of their polarity (positive, negative,
neutral). The most common tool in this category is VADER (Valence Aware Dictionary and
sEntiment Reasoner). Advantages of VADER: ease of use, high accuracy for
Englishlanguage text, particularly short texts such as tweets or comments and taking into account
the intensity of the mood through punctuation, capital letters and emojis. Disadvantages of
VADER:
1.1. Limited language support, i.e., VADER is primarily focused on the English language.
1.2. Low accuracy for multilingual analysis or texts from social networks that contain slang,
abbreviations, and mixed languages.
1.3. Difficulties with context arise when dictionary methods fail to take into account the
context, which can lead to errors in determining mood.</p>
        <p>VADER is effective for fundamental sentiment analysis of English-language content; however,
its limitations in multilingualism and contextuality render it unsuitable for complex multilingual
tasks.</p>
        <p>2. Transformers are modern models for natural language processing (NLP) that use a
selfattention mechanism to take into account the context of each word in the text. They
provide high accuracy for sentiment analysis, even for multilingual content.
2.1. BERT (Bidirectional Encoder Representations from Transformers) is a bidirectional model
that takes into account the context of words both to the left and right of the current word.
Advantages: high accuracy for sentiment analysis due to contextual consideration and the
ability to further train on specific data, which increases accuracy for particular tasks.
2.2. RoBERTa (A Robustly Optimised BERT Pretraining Approach) is an advanced version of
BERT that uses better pre-training techniques and larger amounts of data – advantages:
higher accuracy compared to BERT, as well as the ability to adapt to specific tasks. The
disadvantage is that, like BERT, RoBERTa is based on English-language data in the basic
version. RoBERTa is effective for English-language sentiment analysis, but its limitations in
multilingualism remain.
2.3. DistilBERT is a simplified version of BERT that provides faster operation and lower
resource requirements. Advantages: lower requirements for computing resources and
speedier word processing. Disadvantages: reduced accuracy compared to BERT and limited
language support. DistilBERT is a compromise between accuracy and speed, but its
limitations in multilingualism remain.
2.4. XLM-RoBERTa (Cross-lingual RoBERTa) is a multilingual version of RoBERTa that
supports more than 100 languages, including Ukrainian, Russian, and English. Advantages:</p>
        <sec id="sec-2-3-1">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>Multilingualism</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>Resources</title>
        </sec>
        <sec id="sec-2-3-4">
          <title>Contextuality Low Low Low</title>
          <p>Low</p>
        </sec>
        <sec id="sec-2-3-5">
          <title>High</title>
          <p>Low</p>
        </sec>
        <sec id="sec-2-3-6">
          <title>High</title>
        </sec>
        <sec id="sec-2-3-7">
          <title>High</title>
        </sec>
        <sec id="sec-2-3-8">
          <title>Average</title>
        </sec>
        <sec id="sec-2-3-9">
          <title>High</title>
          <p>Low</p>
        </sec>
        <sec id="sec-2-3-10">
          <title>High</title>
        </sec>
        <sec id="sec-2-3-11">
          <title>High</title>
        </sec>
        <sec id="sec-2-3-12">
          <title>High</title>
        </sec>
        <sec id="sec-2-3-13">
          <title>High</title>
          <p>High accuracy for multilingual sentiment analysis. Adaptation to texts in different
languages, including less common ones. The possibility of additional training on specific
data.</p>
          <p>The disadvantage is the high requirements for computing resources. XLM-RoBERTa is the best
choice for multilingual sentiment analysis because it achieves high accuracy across texts in various
languages.</p>
          <p>Among the sentiment analysis models considered, XLM-RoBERTa is one of the best choices for
multilingual text analysis on social networks due to its accuracy, contextuality, and support for
multiple languages. Traditional dictionary approaches, such as VADER, are less effective due to
their limited multilingual capabilities and lack of contextual consideration. Transformer-based
models, such as BERT, RoBERTa, and DistilBERT, demonstrate high accuracy for English-language
content; however, their limitations in multilingualism make them less suitable for the system under
development. That is why it was decided to use an individual approach, employing different models
for various languages, as well as for symbols and emojis. In the future, with the development of
models, there will be a complete transition to specialised models for each language.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Comparison of approaches to analysing sentiment trends over time</title>
        <p>
          Analysing mood trends over time is a crucial task for understanding the dynamics of changes in
users' emotional responses to content on social networks [
          <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref13 ref14 ref15 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">1–24</xref>
          ]. This approach enables you to
assess how the perception of posts evolves, identify peaks in activity, and discern long-term trends.
This section discusses the primary methods for analysing sentiment trends over time, including
time series analysis, moving averages, and sentiment clustering. It compares existing approaches
with those proposed in the system under development.
        </p>
        <p>1. Time series is a method of analysing data that changes over time, using forecasting and
trend detection models. In the context of sentiment analysis, time series enable you to
estimate the change in sentiment of comments on posts during specific time periods,
offering advantages such as forecasting, trend detection, and flexibility. Time series models
such as ARIMA (AutoRegressive Integrated Moving Average) allow you to predict future
mood changes. Will enable you to identify long-term sentiment trends. Suitable for
analysing data with different time intervals (hours, days, weeks). Disadvantages: data
requirements and complexity of setup. Practical analysis requires large amounts of data that
have a regular time structure. Time series models require careful optimisation of their
parameters. Time series are a powerful method for analysing sentiment trends over time;
however, their effectiveness depends on the quality and regularity of the data.
2. Moving averages are a method that uses averaging sentiment values over a specific time
period to smooth out short-term fluctuations and identify general trends. Advantages:
simplicity, data smoothing and visualisation. Easy to implement and customise. Allows you
to reduce the impact of noise and short-term fluctuations. Moving averages integrate
seamlessly with charting tools to build trends – disadvantages include the loss of parts and
limited predictive ability. Anti-aliasing can hide significant short-term mood changes.
Moving averages do not allow you to predict future mood changes. Moving averages are a
simple and effective method for identifying general trends, but their limitations in
forecasting make them less suitable for complex tasks.
3. Mood clustering is a method of grouping comments according to similar emotional
characteristics, followed by an analysis of their dynamics over time. Advantages: pattern
detection, relationship analysis, and flexibility. Allows you to identify comment groups with
similar sentiments and their changes over time. You can identify relationships between
sentiment and other factors (for example, when posts are posted). Various clustering
algorithms, such as K-Means, DBSCAN, or hierarchical clustering, can be used.
Disadvantages are complexity and resource requirements. Clustering requires careful
adjustment of parameters, such as the number of clusters. Analysing large amounts of data
can be a resource-intensive process. Sentiment clustering is an effective method for
analysing sentiment dynamics in depth; however, its complexity can be a limiting factor.</p>
        <p>In the system being developed, a combined approach will be used to analyse mood trends over
time:</p>
        <sec id="sec-2-4-1">
          <title>1. Time series to predict mood changes based on historical data.</title>
          <p>2. Moving averages are used to smooth out short-term fluctuations and identify general
trends.
3. Clustering to group comments by similar emotional characteristics and analyse their
relationships over time.</p>
          <p>This approach allows you to consider the benefits of each method, providing an accurate and
indepth analysis of moods over time.</p>
          <p>Analysing mood trends over time is a crucial task for understanding the dynamics of the
audience's emotional responses. Among the methods considered, time series provide the best
predictive ability, while moving averages allow data to be smoothed to identify general trends, and
clustering adds the ability to analyse the relationships between sentiment and other factors in
depth. The proposed combined approach in the system under development enables the integration
of the advantages of all methods, providing an accurate analysis of mood changes over time.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>2.6. Formulation of the problem and justification of the need to develop a system</title>
        <p>In today's world, social media, including Instagram, has become a crucial source of data for
analysing audience sentiments, trends, and behaviours. User comments contain valuable
information that can be used for business intelligence, sociological research, and public opinion
monitoring. However, existing solutions for automating text analysis in social networks have
significant drawbacks that limit their effectiveness, especially for multilingual content and local
languages such as Ukrainian. This section summarises the main problems that arise during the
analysis of texts from social networks, substantiates the need for a new system, and determines
how the proposed approach is superior to its analogues. Disadvantages of existing solutions:
1. Low accuracy for local languages. Most modern text analysis systems, such as Google
Cloud Natural Language and Microsoft Azure Text Analytics, are focused on
Englishlanguage content. The Ukrainian language is often underrepresented in educational
datasets, resulting in low accuracy in mood analysis and language definition. Many models
overlook the slang, abbreviations, and specific grammar of local languages, which
compromises the quality of the analysis.
2. Limited support for multilingualism. Existing systems have limited support for multilingual
content, especially for texts that contain mixed languages (e.g., Ukrainian and English).
Multilingual models, such as mBERT, often exhibit reduced efficiency for less common
languages, including English.
3. The complexity of working with the texts of social networks. Texts from social networks
have specific features, such as slang, abbreviations, emojis, and symbols, that are not taken
into account by many existing models. Dynamic content, such as Instagram comments,
uploaded via JavaScript, creates technical challenges for data collection. The lack of
adaptation to short texts, which are often found in comments, reduces the accuracy of the
analysis.
4. Lack of integration. Most systems do not offer a comprehensive approach that includes
parsing, sentiment analysis, language group definition, and trending over time. Existing
solutions often require the integration of multiple tools, making them difficult to use.</p>
        <p>Based on the analysis of the shortcomings of existing solutions, the following key problems are
formulated, which are solved by the developed system:
1. Improving accuracy for local languages. The system under development utilises modern
transformer models, such as XLM-RoBERTa, which offer high accuracy for Ukrainian,
Russian, and English. Adding models to specific data from social networks enables you to
account for the peculiarities of slang, abbreviations, and symbols.
2. Expanding support for multilingualism. The system provides analysis of mixed-language
texts, including determining the language group for each comment. The integration of
multilingual models enables you to work effectively with texts in multiple languages.
3. Adaptation to texts from social networks. The system takes into account the specific
characteristics of texts from social networks, including emojis, symbols, and abbreviated
text forms. The use of specialised algorithms to collect data from dynamic content
(Selenium) provides access to comments that are loaded via JavaScript.
4. Integrated approach. The system integrates all the key stages of analysis, including parsing
comments, determining language, analysing moods, building trends over time, and
visualising results. The possibility of interactive data analysis through visualisation (Plotly)
is provided.</p>
        <p>Justification of the need to develop the system:
1. The uniqueness of the proposed approach. Existing solutions focus on specific aspects of
text analysis, such as parsing or sentiment analysis, but do not offer a comprehensive
approach. The system under development integrates all stages of analysis, which provides
a complete cycle of data processing from social networks.
2. Advantages over analogues: accuracy, flexibility, interactivity and comprehensiveness. The
use of XLM-RoBERTa enables you to achieve high accuracy for multilingual content,
including Ukrainian. The system is adapted to texts from social networks, taking into
account their specifics. Data visualisation through Plotly enables users to interact with the
analysis results. The system under development encompasses all stages of data analysis
within a single product.
3. Importance for Ukraine. The development of a system that takes into account the specific
characteristics of the Ukrainian language contributes to the advancement of local
technologies in the field of natural language processing (NLP). The system can be used to
monitor public sentiment, assess the effectiveness of information campaigns, and combat
disinformation.</p>
        <p>Existing social media text analysis solutions have significant drawbacks that limit their
effectiveness for multilingual content and local languages. The system under development
addresses these problems by integrating modern models for sentiment analysis, multilingual text
processing, and adaptation to the specifics of social networks, while providing an integrated
approach to data analysis. Its implementation will contribute to improving the accuracy of analysis,
expanding support for local languages, and providing interactive data analysis, making it an
important tool for businesses, researchers, and organisations, especially in the context of Ukraine.</p>
        <p>The analytical review revealed that existing solutions for analysing texts from social networks
have significant limitations, making it difficult to utilise them effectively for multilingual content,
particularly for local languages such as Ukrainian. The main drawbacks include low accuracy of
sentiment analysis for less common languages, limited support for texts from social networks,
difficulty working with dynamic content, and a lack of a comprehensive approach to data analysis.
The considered tools and methods, such as Google Cloud Natural Language, Microsoft Azure Text
Analytics, VADER, XLM-RoBERTa, Selenium, and Plotly, demonstrate strengths in certain aspects
of the analysis but do not provide a comprehensive data processing cycle from social networks.
They are either focused on English-language content or do not take into account the specifics of
texts from social networks, such as slang, emojis, and mixed languages. The developed system has
several significant advantages over existing analogues, including multilingualism, adaptation to
texts from social networks, comprehensiveness, interactivity, and flexibility. The use of the modern
XLM-RoBERTa model yields high accuracy in sentiment analysis for Ukrainian, Russian, and
English, taking into account the characteristics of local language groups. The system takes into
account the specific characteristics of texts from social networks, such as slang, abbreviations,
emojis, and symbols, which enhances the quality of analysis. Integration of all key stages of
analysis – parsing comments, determining language, sentiment analysis, building trends over time,
and data visualisation – within one product. Using Plotly to visualise results enables interaction
with the data, making it easier to analyse and interpret. The system can be further trained on
specific data from social networks, which allows it to be adapted to new challenges and tasks.</p>
        <p>The developed system addresses the key problems that arise during the analysis of texts from
social networks, offering multilingualism, adaptation to local languages, and an integrated
approach to data processing. Its implementation will contribute to increasing the efficiency of text
analysis, the development of local technologies, and the creation of new opportunities for
businesses, researchers, and organisations.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System analysis of the product under development</title>
      <p>System analysis is a crucial stage in the development of information systems, as it enables a
thorough exploration of the subject area, identification of key system requirements, and
justification of its architecture and structure. In the context of developing a system for automating
the analysis of comments from social networks, system analysis provides an understanding of the
complex, multi-level interaction between the system's components, as well as determining the
optimal solutions for addressing the tasks. Social networks, such as Instagram, generate vast
amounts of text data that contain valuable insights into user sentiment, trends, and behaviour.
However, analysing this data is a challenge due to multilingualism, the use of slang, emojis,
symbols, and the dynamic nature of the content. Developing a system that automates the analysis
of such data requires a systematic approach that takes into account all aspects, from data collection
to processing, analysis, and visualisation. The main goals of the system development include:
1. Automation of data collection from social networks – ensuring effective scraping of
comments from dynamic Instagram content.
2. Multilingual analysis of texts – determination of the language group of comments, analysis
of moods and taking into account the specifics of local languages (Ukrainian, Russian,
English).
3. Building sentiment trends over time – identifying changes in user sentiment depending on
time and other factors.
4. Interactive visualisation – creating graphs and reports that allow users to interact with the
results of the analysis.</p>
      <p>System analysis will not only allow us to determine the main components of the system and
their interactions, but also to justify the choice of architecture and implementation methods. It will
ensure the creation of an efficient, reliable, and scalable system that addresses the current
challenges of analysing texts in social networks.</p>
      <p>Social media platforms like Instagram are platforms where users actively interact through text,
images, videos, and comments. Instagram allows users to publish posts that other users can
comment on and react to with likes and emojis. Comment texts on social networks have several
essential characteristics that affect their analysis: content dynamism, multilingualism and a large
amount of data. Comments are constantly changing, new ones are added, and old ones can be
deleted. Instagram is a global platform, allowing comments to be written in a variety of languages,
including English, Ukrainian, Russian, and others. Popular posts can garner thousands of
comments, which creates challenges in processing large amounts of information. Multilingual
content on Instagram often features comments written in different languages within the same post.
It creates challenges for analysis, such as speech recognition (automatic detection of the language
of the comment), processing of texts with different grammatical structures (each language has its
own rules for constructing sentences that must be taken into account), translation and
normalisation (for sentiment analysis, comments can be translated or brought to a single format).
The dynamism of comments means that the data is constantly changing. It creates challenges such
as the timeline and the relevance of the data. Comments are added in real-time, which is essential
for analysing mood changes. Old comments may become irrelevant over time, so the system must
take this into account. Social media comment texts often feature several characteristics, including
slang and abbreviations, emojis, spelling mistakes, and short phrases. Users frequently use informal
words, abbreviations, and acronyms, such as "OMG" and "LOL". Emojis are an integral part of
communication on Instagram, as they effectively convey emotions and moods. Comments often
contain errors due to the speed of writing or the informality of the platform. The comments are
typically brief, making it challenging to comprehensively analyse the context.</p>
      <p>The general goal of the system is to develop an automated system for analysing texts in
Instagram comments to determine user sentiments, taking into account multilingualism, the
dynamism of content, and the specificity of texts. The primary objectives are data collection, text
processing, mood analysis, and visualisation of results. Data collection involves parsing comments,
processing dynamic data, and utilising web scraping. Text processing encompasses normalisation,
pre-processing, and processing of multilingual content. Sentiment analysis involves pinpointing
sentiments, analysing emojis, and utilising Transformer models. Visualising the results consists of
distributing sentiments by language, displaying mood changes in graphs, and filtering data. The
results of achieving the objectives will define the criteria for the system's functioning, including
relevance, accuracy, speed, scalability, and versatility.</p>
      <p>To develop a system for analysing texts in Instagram comments, several alternative options are
being considered. These options included a selection of models for sentiment analysis, tools for
parsing comments, and methods for processing multilingual content. After analysing the
advantages and disadvantages of each option, it was decided to use the transformer models for
sentiment analysis, since they provide high accuracy and take into account the context of the texts.
Alternative options for sentiment analysis:
1. Rule-based models use predefined lexicons (for example, dictionaries of positive and
negative words). Advantages: ease of implementation, does not require large amounts of
data for training. Disadvantages include low accuracy for texts containing slang, emojis,
and abbreviations.
2. Machine learning-based models use algorithms such as Logistic Regression, Random
Forest, or SVM. Advantages: high accuracy for texts with well-defined features.</p>
      <p>Disadvantages: require large amounts of data for training.
3. Deep Learning-based models use neural networks such as LSTM, GRU, or transformers
(e.g., BERT, RoBERTa). Advantages: high accuracy for complex texts, takes into account the
context. Disadvantages include high computational complexity and the need for significant
resources for training.</p>
      <p>For sentiment analysis, transformer models (e.g., BERT, RoBERTa, or XLM-R) were chosen, as
they provide the highest accuracy for complex texts, take context into account, and support
multilingualism. It is vital for analysing texts on Instagram, where comments may contain slang,
emojis, and abbreviations. Alternative options for parsing comments:
1. Instagram API – Using the official API to fetch data. Advantages: reliability, access to
upto-date data. Disadvantages include restrictions on the number of requests and the need for
authorisation.
2. Web Scraping – using libraries for parsing HTML (for example, BeautifulSoup).</p>
      <p>Advantages: Ability to bypass API restrictions. Disadvantages: Risk of violating Instagram's
terms of service and privacy policy.</p>
      <p>Web Scraping was chosen for parsing comments because it provides reliable access to
up-todate data. In the event of API restrictions, alternative approaches can be used as a fallback.</p>
      <p>Alternative options for handling multilingual content:
1. Automatic language detection – using libraries such as langdetect or fastText. Advantages:
speed and accuracy of language detection. Disadvantages: possible errors for short texts.
2. Multilingual models for sentiment analysis – using transformers such as mBERT or XLM-R.</p>
      <p>Advantages: Support for multiple languages, taking context into account. Disadvantages:
high computational complexity.</p>
      <p>Multilingual transformer models (e.g., mBERT or XLM-R) were chosen to handle multilingual
content because they support text analysis in multiple languages and take into account context,
which is crucial for multilingual content on Instagram.</p>
      <p>To select the best option, the analytical hierarchy process (AHP) method was employed. A
comparison of alternative options was carried out according to the following criteria:</p>
      <sec id="sec-3-1">
        <title>1. Accuracy (how accurately the model determines the mood of the text). 2. Computational complexity (resources required to process data). 3. Flexibility (the ability to adapt to multilingual content). 4. Data availability (ease of obtaining data for analysis).</title>
        <p>Based on the scores obtained using these criteria, it was decided to utilise transformer models
for sentiment analysis, Instagram APIs for parsing comments, and multilingual models for
processing multilingual content. Thus, the systematic analysis of the object of study and the subject
area made it possible to consider all the features of the texts in Instagram comments and select the
best options for implementing the system. The system under development is designed to automate
the process of analysing comments on social media platforms, particularly on Instagram. Its main
functions are:</p>
        <p>1. Comment Scraping – Automatically collect comments from social media posts, including
dynamic content that is loaded via JavaScript. Parsing is provided using the Selenium tool.
Also, collecting metadata such as the time the comment was published, the author, and the
related post.
2. Language Definition – Automatically detects the language group of each comment (for
example, Ukrainian, Russian, English, or mixed languages). Also, marking comments that
consist only of symbols or emojis.
3. Sentiment analysis – the use of modern transformer models, such as XLM-RoBERTa, to
determine the polarity of moods (positive, negative, neutral). Additionally, adapting models
to the specificities of texts from social networks, including slang, abbreviations, and emojis.
4. Trend building – analysing mood changes over time, creating time series, and identifying
long-term trends. Additionally, the construction of graphs that reflect the dynamics of
moods over time, language groups, or other factors.
5. Interactive visualisation – using interactive graphs (based on Plotly) to present analysis
results that allow users to interact with data, scale graphs, and highlight key segments.</p>
        <p>The system is a universal tool for multilingual text analysis, enabling the automation of data
collection, processing, and presentation. The system has a wide range of potential users, including:
1. Businesses, in particular, analyse consumer reactions to products, services, or marketing
campaigns, as well as identifying trends in customer feedback to make informed decisions.
2. Researchers, such as sociologists, linguists, and analysts, can use the system to study public
opinion, audience behaviour, and language characteristics, as well as monitor changes in
the population's mood on socially significant topics.
3. Organisations, such as civil society organisations, can utilise the system to monitor public
sentiment, evaluate the effectiveness of information campaigns, and detect disinformation.
Government agencies can also use the system to analyse public opinion on political/social
initiatives.</p>
        <p>The system is helpful for any organisation that works with large amounts of text data on social
media platforms and requires automated analysis. Social media text analysis faces several
challenges that limit the effectiveness of existing solutions, including low accuracy for local
languages, limited support for multilingualism, the complexity of working with social media texts,
and a lack of integrated solutions. Most commercial systems, such as Google Cloud Natural
Language and Microsoft Azure Text Analytics, are primarily focused on English-language content
and exhibit lower accuracy for Ukrainian and Russian. Existing solutions do not cope well with
texts that contain mixed languages, slang and symbols. Features of texts from social networks, such
as abbreviations, emojis, and dynamic content, are often overlooked. Most systems focus on
specific aspects of analysis (for example, only parsing or only sentiment analysis), which makes it
challenging to use them for complex analysis. The system under development addresses these
problems by integrating modern technologies, including transformers (XLM-RoBERTa),
multilingual models for language detection (FastText), and tools for interactive visualisation
(Plotly). Expected effects from the implementation of the system: improving the accuracy of
sentiment analysis, supporting multilingualism, process automation, interactivity, and the
development of local technologies. The use of modern models for multilingual analysis provides
high accuracy even for local languages (Ukrainian, Russian). The system supports text analysis in
different languages, including mixed languages, making it versatile for working in multilingual
environments. Automatic parsing of comments and sentiment analysis significantly reduces labour
costs for processing large amounts of data. Interactive graphs allow users to easily analyse the
results and gain deeper insights. The system contributes to the development of natural language
processing (NLP) technologies for the Ukrainian language, which is essential in the context of
Ukraine. Conceptual model of the system:
1. Input: Comments (text data collected from Instagram, including mixed languages, symbols,
and emojis) and metadata (time the comment was published, author, related post).
2. Initial data: the results of sentiment analysis (the polarity of each comment as positive,
negative, or neutral), trends (graphs of mood changes over time) and reports (interactive
reports including sentiment analysis, frequency of language groups, time trends).
3. Functions and structure of the system: parsing module (automatic collection of data from
Instagram), language detection module (classification of the language group of each
comment), mood analysis module (determination of the emotional polarity of texts) and
visualisation module (construction of interactive graphs and reports).
4. System requirements: performance (fast processing of large amounts of data), accuracy
(high accuracy of sentiment analysis for multilingual content), relevance (correspondence
of data information in real time), and scalability (ability to process data from different
sources, not only Instagram).</p>
        <p>The system being developed is designed to automate the analysis of comments on social media
platforms, particularly Instagram. Its implementation will enhance the accuracy of sentiment
analysis, support multilingualism, and interactivity, making it a valuable tool for businesses,
researchers, and organisations. The conceptual model of the system defines its key components and
requirements, ensuring the effective implementation of tasks.</p>
        <p>The system analysis made it possible to thoroughly investigate the problems associated with
automating text analysis in social networks and to identify the key aspects of system development.
The subject area was studied in detail, and the general goal of the system was formulated. A tree of
goals was built, which structures the tasks that must be performed to achieve this goal. The
analysis of alternative approaches to building a system made it possible to justify the choice of the
optimal architecture, which provides efficiency, flexibility, and scalability. The key aspects of
system development were: justification of the chosen architecture, modelling the system's structure
and functions, and determining the system's requirements. The system is based on modern
technologies, including transformer models (XLM-RoBERTa) for multilingual sentiment analysis,
Selenium for the automatic collection of dynamic content, and seaborn for visualisation. This
choice enables you to ensure high accuracy of analysis, adapt to the specifics of texts from social
networks, and achieve ease of use. The requirements for performance, accuracy, interactivity and
scalability of the system were formulated. It ensures its ability to process large amounts of data,
work with multilingual content, and adapt to changes in the subject area. The creation of such a
system is a crucial step in addressing the current challenges of analysing texts in social networks. It
enables you to automate data collection, processing, and analysis, increase the accuracy of
sentiment analysis, maintain multilingual support, and provide interactive visualisation of results.
It makes the system useful for businesses, researchers, NGOs, and government agencies that work
with large amounts of text data. Thus, the conducted system analysis laid the groundwork for the
further implementation of a system that addresses the challenges of analysing texts in social
networks and meets the modern requirements for information systems.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Selection of methods and means of the product being developed</title>
      <p>The choice of methods and means is a key stage in the development of information systems, as the
efficiency, accuracy, and productivity of the final product depend on it. For the development of a
system to automate the analysis of comments on social networks, particularly Instagram, it is
essential to consider the specific characteristics of texts, which are often multilingual, unstructured,
and contain slang, emojis, abbreviations, and symbols. It creates additional challenges for data
collection, processing, and analysis. The main tasks that the system solves are:
1. Comment parsing – automatic collection of text data from social networks, including
dynamic content generated by JavaScript.
2. Language Detection– Automatically detect the language group of texts, including support
for mixed languages and texts with symbols.
3. Sentiment analysis – determining the emotional polarity of comments (positive, negative,
neutral), taking into account the context.
4. Trend building – analysing mood changes over time and identifying long-term trends.
5. Interactive visualisation – presenting results in the form of graphs and reports that allow
users to interact with data.</p>
      <p>To address these issues, several key requirements for the methods and technologies employed in
the system have been identified: accuracy, performance, multilingualism, adaptability, and
interactivity. The methods should provide high accuracy of sentiment analysis, especially for local
languages (Ukrainian, Russian) and multilingual content. The system must efficiently process large
amounts of data, including thousands of comments from social networks. The methods should
support the analysis of texts in various languages, including those that are mixed. The technologies
used should take into account the specific characteristics of texts from social networks, such as
slang, emojis, and abbreviations. Visualisation tools should provide ease of use and allow for
interactive graphing. To achieve the set goals, modern technologies and libraries that meet these
requirements were chosen. In particular, Selenium is used for parsing, while transformer models
from the Transformers library are utilised for sentiment analysis. The langdetect library is
employed for language detection, and the Seaborn and Matplotlib libraries are utilised for data
visualisation. Each of these tools has been selected based on its benefits in its respective field,
ensuring the system's efficiency and reliability. Thus, the choice of methods and means is a critical
stage that determines the success of the system implementation. In the following sections, the
choice of each technique and tool will be substantiated in detail, and a comparative analysis with
analogous tools will be conducted to confirm their effectiveness.</p>
      <sec id="sec-4-1">
        <title>4.1. Sentiment analysis methods</title>
        <p>Sentiment analysis is a key task for the system being developed, as it enables the determination of
users' emotional reactions to content on social networks. For this, modern transformer models are
used, which provide high accuracy of analysis due to the context. Advantages of transformer
models include high accuracy, multilingualism, and adaptability. Transformers consider the context
of each word, enabling you to accurately determine the emotional polarity of the text. Models such
as XLM-RoBERTa support text analysis in different languages, which is critical for a multilingual
social media environment. Models can be further trained on specific data from social networks,
which allows slang, abbreviations, and emojis to be taken into account. Comparison with
traditional methods:
1. Dictionary Approaches (VADER) is a simple sentiment analysis tool that is based on word
polarity dictionaries. Erevaga: Ease of use, high accuracy for English-language content.
Disadvantages: low accuracy for multilingual content and a lack of consideration for
context.
2. Transformers offer significantly higher accuracy due to their contextual understanding, but
require more resources to operate.</p>
        <sec id="sec-4-1-1">
          <title>Models used:</title>
          <p>1. A multilingual version of XLM-RoBERTa that supports more than 100 languages, including
Ukrainian, Russian, and English. It is used to analyse the sentiments of comments in
Ukrainian. Model is cardiffnlp/twitter-xlm-roberta-base-sentiment.
2. RuBERT is a model designed for analysing texts in Russian, specifically adapted to the
language's unique characteristics. It is used to analyse the sentiment of comments in
Russian. Model is blanchefort/rubert-base-cased-sentiment.
3. RoBERTa is an English-language model that demonstrates high accuracy for texts from
social networks. It is used to analyse the sentiment of comments in the English language.</p>
          <p>Model is cardiffnlp/twitter-roberta-base-sentiment.
4. RoBERTa for symbols is used to analyse texts consisting only of symbols or emojis. Model
is cardiffnlp/twitter-roberta-base-sentiment.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Methods for determining language</title>
        <p>Determining the language of texts is an essential task for multilingual analysis. The system under
development uses the langdetect library, which provides automatic detection of the language group
of texts. The library uses statistical models to classify texts by language. Advantages: ease of
integration into Python projects, high speed and support for more than 50 languages, including
Ukrainian, Russian and English. Disadvantages include low accuracy for short texts, which are
often found in comments, as well as limited support for mixed languages (for example, when
Ukrainian and English are used in the exact text).</p>
        <p>Comparison with other methods:
1. FastText is a library that demonstrates high accuracy for short texts and supports more
than 170 languages. Advantages: high accuracy, adaptability to texts from social networks.</p>
        <p>Disadvantages: resource requirements, complexity of integration.
2. Langdetect is a straightforward solution for basic language detection, but it is less effective
for complex texts.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Word processing methods</title>
        <p>Pre-processing of texts is a critical step to ensure the accuracy of the analysis. The following
approaches are used in the system being developed:</p>
        <p>I. Pre-treatment:
1. Normalisation of texts: Remove unnecessary characters, spaces, and punctuation; Convert
text to lower case; Using the unicodedata library to remove accents and special characters.
2. Cleaning Texts: Remove URLs, emojis, and other irrelevant items; Using regular
expressions (re) to find and remove specific patterns; Tokenisation is the process of
splitting text into separate words or tokens for further analysis.</p>
        <sec id="sec-4-3-1">
          <title>II. Using regular expressions (re):</title>
          <p>1. Simplicity – regular expressions allow you to process text data efficiently.
3. Flexibility – the ability to adapt to the specifics of texts from social networks.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Visualization methods</title>
        <p>Data visualisation enables you to present the results of the analysis clearly and understandably.
The following libraries are used in the system under development:
1. Seaborn is a tool for creating aesthetic charts. It is used to build bar graphs, heat maps, and
line graphs.
2. Matplotlib is a basic library for creating graphs. It is used to adjust the details of graphs
(colours, labels, axes).
3. Statsmodels is used to smooth data using the LOWESS (Locally Weighted Scatterplot</p>
        <p>Smoothing) method. Allows you to identify trends in time series.</p>
        <p>Advantages: flexibility (the ability to create both static and interactive graphs), aesthetics
(graphs look modern and clear), and interactivity (integration with Plotly to build interactive
graphs). The methods for solving the problem were chosen taking into account the specifics of
texts from social networks, as well as the requirements for accuracy, productivity, and
multilingualism. The use of transformer models for sentiment analysis, language detection libraries
such as langdetect, regular expressions for text processing, and modern libraries for visualisation
ensures the efficiency and reliability of the system. Each method was chosen based on its
advantages in the relevant field, allowing you to solve tasks with high quality.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Means of solving the problem</title>
        <p>Python was chosen as the primary programming language to automate the analysis of comments
from social networks. Python is a versatile, user-friendly, and powerful tool for working with text
data, analysing large datasets, and developing complex systems. Its popularity in the field of
natural language processing (NLP), the presence of numerous libraries, and an active community
make Python the optimal choice for this project. The rationale for choosing Python is versatility,
simplicity, community, and scalability. Python supports a wide range of libraries for working with
texts, data analysis, and visualisation. Easy to digest and allows you to quickly develop complex
systems – a large user base and an active community that provides access to documentation and
support. Python is well-suited for working with both small datasets and large amounts of
information. Overview of the libraries used:
1. Selenium is used to automatically scrape comments from dynamic Instagram content that
JavaScript generates. Selenium enables you to interact with web pages in a manner similar
to a real user, including clicking buttons, scrolling through pages, and filling out forms.
Advantages: the ability to work with dynamic content and flexibility in setting up parsing
for specific tasks. Disadvantages: high resource requirements compared to other tools.
2. Pandas is used to process and analyse structured data. Pandas allows you to work
conveniently with data tables (DataFrames). Support for filtering, aggregation,
ru, HEROES GLORY
symbols_only,symbols_only,symbols_only,en,No ❤️
one will care as much as He is true
en, Glory to Ukraine
symbols_only,en,A ❤️❤️
uk,Win
en,is with
symbols_only, 
symbols_only, UK,take ❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️
care of all and symbols_only,symbols_only,UK,MPs 
' salaries and then there will be</p>
        <sec id="sec-4-5-1">
          <title>Database view after executing the module Process_data:</title>
          <p>Main_Language,Filtered_Comment
en, I presented state awards Crosses of Military Merit and Orders of the Golden Star to our and handed the
orders to the family members of the Heroes who were posthumously awarded this Everyone who Everyone
who works for defense and truly for the state not just for Everyone who I thank I thank the families of our
warriors for such for such We will undoubtedly endure this And we will certainly ensure a dignified life for
great President to be respected and thank you to the Warriors of
%❤️</p>
          <p>The Analyze_data module uses transformer models that add a sentiment column to the
database, where one of three values (positive, neutral, or negative) is possible (Fig. 8).</p>
          <p>Database view after executing the module Process_data:
Main_Language,Filtered_Comment,Sentiment
en,I presented state awards Crosses of Military Merit and Orders of the Golden Star to our and handed the
orders to the family members of the Heroes who were posthumously awarded this Everyone who Everyone
who works for defense and truly for the state not just for Everyone who I thank I thank the families of our
warriors for such for such We will undoubtedly endure this And we will certainly ensure a dignified life
for,positive
ru,HEROES GLORY,neutral
symbols_only,
symbols_only,  
symbols_only,
,positive
  ,positive
,positive
en,No one will care as much as He is true,neutral
en,Glory to Ukraine,positive
en,A great President to be respected and thankyou to the Warriors of,positive
uk,Win,positive
symbols_only,</p>
          <p>,neutral</p>
          <p>In the Aggregate_data module, all files with posts are numbered and combined into one file (Fig.
9), where the id column indicates the post number, with 1 being the newest and each subsequent
number representing an older post. And the sub_id column means the order in which the comment
is displayed, which is formed by Instagram algorithms based on likes and other indicators.</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>Database view after executing the aggregate_data module:</title>
          <p>id,sub_id,Main_Language,Sentiment,Filtered_Comment1,1,en,positive,I presented state awards
Crosses of Military Merit and Orders of the Golden Star to our and handed the orders to the family
members of the Heroes who were posthumously awarded this Everyone who Everyone who works
for defense and truly for the state not just for Everyone who I thank I thank the families of our
warriors for such for such We will undoubtedly endure this And we will certainly ensure a
life
for1,2,ru,neutral,HEROES
dignified
,No</p>
          <p>️❤
GLORY1,3,symbols_only,positive,1,5,symbols_only,positive,1,6,symbols_only,positive,1,7,en,neutral</p>
          <p>will care as much as He is true1,8,en,positive,Glory to
Ukraine1,9,symbols_only,positive,1 ❤️❤️, 10,en,positive,A great President to be respected and
thankyou
to
the</p>
          <p>Warriors
of1,11,uk,positive,Win1,12,en,neutral,is
with1,13,symbols_only,neutral,1,14,symbols_only,positive,1,15,uk,neutral,save 
❤️❤️❤️❤️❤️❤️❤️❤️❤️❤️all

%❤️
salaries and then it will be1, 19,symbols_only,positive,1,20,en,positive,I 
am continuously so
impressed
by</p>
        </sec>
        <sec id="sec-4-5-3">
          <title>President</title>
          <p>He
seems
to
work
so
tirelessly
for
his
country
 '
and
God1,21,uk,neutral,Eternal1,22,uk,negative,Eternal and Light1,23,uk,neutral,how</p>
          <p>The visualize_data module is the final module, in which the dataset is evaluated, and all the data
within it is visualised.</p>
        </sec>
        <sec id="sec-4-5-4">
          <title>Additionally, the solidarity index is calculated, which represents the percentage of comments that match the author's description of the post. Here are some common characteristics of the dataset, shown in Fig. 10.</title>
          <p>In the following visualisation, you can see the number of comments for each post, where the
first is the most recent, the twentieth is the oldest (Fig. 11a). Next, you can see how many
comments were written by each of the language categories (Fig. 11b).</p>
          <p>In the following images, you can see the tonality distribution for each language, as well as for
each post (Fig. 12). Fig. 13 shows the distribution of comment lengths. Fig. 14 shows how the length
of a comment depends on the key, as well as on the language in which it was written.</p>
          <p>Fig. 15a analysed the longest comments. Fig. 15b presents the solidarity index in tabular form.
Fig. 16a and Fig. 17a show the distribution of moods within the comments of each language. Next,
the solidarity index for each language was analysed, as well as the sentiments expressed in the
comments for each language across all posts (Fig. 16b and Fig. 17b).</p>
          <p>In the following Figs. In Figures 18–20, you can see the characteristics of the distribution of the
solidarity index for each of the languages.</p>
          <p>Graphs in Figs. 21–25 show a change in the positivity of sentiments in the comments for each
language over time. You need to read them from right to left, because 1 is the most recent post and
20 is the oldest. You can see that comments in Russian are very rarely positive, for apparent
reasons (Fig. 22). Fig. 23 shows that comments from symbols are most often positive. The remarks
in Ukrainian have the most stable positivity index, which fluctuates approximately within the
range of 0.2-0.4 (Fig. 24). The positivity graph in Fig. 25, which combines comments from all
languages, shows a rather volatile trend towards change.</p>
          <p>So, following this control example, with the help of existing modules, you can download and
analyse comments on posts from any Instagram page.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7. Program Execution Statistics</title>
      <p>In modern information systems, the efficiency of software is a key factor in determining its
practical value. Performance analysis enables you to assess how efficiently the system performs
tasks, identify potential bottlenecks, and determine ways to optimise. This section presents a
statistical analysis of the system's implementation for automating comment analysis in social
networks. The main attention is paid to the following indicators:
1. The execution time of the main modules is a measurement of the time required to complete
each stage of data processing (collection, processing, sentiment analysis, aggregation, and
visualisation).
2. Data processing – analysis of the number of processed comments at each stage.
3. Resources – An estimate of the use of computing resources such as RAM and CPU.</p>
      <p>It is necessary to evaluate the system's efficiency, identify its strengths and weaknesses, and
provide recommendations for enhancing performance. The statistics will be presented in the form
of tables, graphs, and charts, allowing you to visualise the results of the analysis and make them
more understandable to users. The results obtained will not only confirm the effectiveness of the
developed system, but also provide valuable information for its further improvement. Let's proceed
to the analysis of the program's main modules. Since selenium and an automatic browser client are
used to load (Load_data) all comments as the basis of the dataset, the downloading process takes a
significant amount of time. After all, to make the process as secure as possible and the script
faulttolerant, various approaches, such as WebDriverWait and Time.sleep(), are integrated into it
(Fig. 26). The combination of these approaches avoids blocking the IP address due to excessive
requests. Also, it ensures the fault tolerance of the script (Fig. 27a). Additionally, in guaranteeing
the script's speed, storing browser cookies after the first authorisation plays a crucial role. It avoids
a delay in logging into your account. Also, the performance of the script is affected by the number
of posts uploaded to the target page, as well as the number of comments in each of them (Fig. 27b).
In our case, when uploading 20 posts with an average of 180 comments in each of them, the
execution time reached 36 minutes.</p>
      <p>The primary function of the Process_data module is to filter comments and determine the
primary language in each of them (Fig. 28). In order to save time and resources of the system
during scaling, a highly efficient Langdetect library was chosen, which made it possible to reduce
the module's execution time to about 8 seconds on a dataset of almost 4000 entries. The primary
function of the Analyze_data module is to utilise transformer models specific to each language to
assess the mood of each comment and add the corresponding designation to the dataset (Fig. 29).
This analysis, which involves processing comments through models, is the most time-consuming
step. The longer the comment, the longer it will take to process. Since all the models were
preloaded and most comments on social media are pretty brief, the execution time was only 4 minutes.</p>
      <p>The Aggregate_data module aggregates all received files into a single dataset using the pandas
library, so it does not require much time. It took less than a second to execute (Fig. 30).</p>
      <p>The Visualize_data module does not manipulate data; its primary function is to visualise an
already created dataset (Fig. 31–32). Those graphs and statistics, which are built for the entire
dataset, do not depend significantly on its size, and in general, each took less than a second. The
only exception is the visualisation of mood trends, as we used the lowess function from the
statsmodels library, which is picky about resources. However, the section for analysing individual
posts took significantly longer – 6.2 seconds - and the time of its execution is directly proportional
to the number of posts and comments within them.</p>
      <p>The analysis of the system's implementation statistics enabled an assessment of its effectiveness,
performance, and compliance with the assigned tasks. The results confirmed that the system
successfully handles the functions of collecting, processing, analysing, and visualising large
amounts of text data. Key takeaways:
1. Module execution time. The stage of mood analysis using transformer models takes the
most time, which is due to the high computational requirements of these models. Other
modules, such as data collection, word processing, and aggregation, are relatively fast due
to the use of efficient algorithms and streamlined approaches.
2. System performance. The system is capable of processing large amounts of data,
maintaining high accuracy at every stage. The use of multilingual transformer models
provides a qualitative analysis of moods for Ukrainian, Russian and English.
3. Use of resources. The main load falls on the stage of mood analysis, where transformer
models are used. It requires significant computing resources, including RAM and CPU time.
Other stages, such as text scraping, filtering, and data aggregation, have low resource
requirements.
4. Analysis of processed data. The system demonstrates high efficiency in working with texts
of varying lengths and complexities, including comments with symbols, slang, and mixed
languages. Visualising the results makes it easy to interpret the data obtained, which is
essential for end users.</p>
      <p>Recommendations for improvement:
1. Optimisation of transformer models – use of less resource-intensive models (e.g.</p>
      <p>DistilBERT) for texts with low complexity, as well as integration of methods of preliminary
classification of texts to determine which comments require detailed analysis.
2. Data collection optimisation – using Instagram's official API (subject to availability) to
reduce comment collection time and avoid possible restrictions from the platform.
3. System scaling – the implementation of multithreading or distributed computing for
parallel processing of large amounts of data, as well as the use of cloud services for the
analysis of large data sets.</p>
      <p>The developed system demonstrates high efficiency and productivity in solving the problems of
analysing comments in social networks. It meets modern requirements for automating the
processing of large amounts of text data, providing accurate sentiment analysis, multilingual
capabilities, and convenient visualisation of results. However, further optimisation of computing
resources and integration with other platforms will make the system even more versatile and
productive.</p>
    </sec>
    <sec id="sec-6">
      <title>8. Comprehensive</title>
      <p>accounts
analysis
of
different categories
of Instagram
As an additional task, a new dataset was created, which included more than 20000 comments. Let's
conduct a comprehensive analysis of various categories of Instagram accounts. The first category
we will consider is the business page (Fig. 33). This dataset includes 1600 comments and three
different pages (Fig. 34).</p>
      <p>Therefore, even Ukrainian business accounts primarily focus on the English language. As
expected, according to Fig. 35, symbolic comments contain the most positive dynamics, while in
other languages, neutral comments prevail. On average, comments for business accounts are not
very long, but they are nevertheless longer than those for personal pages (Fig. 36 –37). In these
images, you can see that the comments in business accounts for the most part coincide in tone with
the description from the author of the page.</p>
      <p>Let's proceed to the analysis of the positivity of comments (Figs. 38-41). You can see that most
comments in this topic are neutral or positive, with a clear trend towards improving the sentiment
of comments over time.</p>
      <p>The entertainment category contains 1766 comments from 3 different pages (Fig. 42). In
category discussions, most of the comments turned out to be neutral (Fig. 43a), even in the
symbol_only category. Comments turned out to be significantly longer than those in the category
of business pages (Fig. 43b). The solidarity index was, on average, the same as that of business
pages, but with a different distribution by language (Fig. 44–45).</p>
      <p>Let's proceed to the analysis of moods (Figs. 46–50). You can see that pages from the
entertainment category have a significantly lower number of positive comments compared to
business accounts. However, a positive trend is also evident over time.</p>
      <p>The last and largest analysed category is political accounts (Fig. 51). Three accounts, with a total
of 17,000, were analysed. As in the category of business accounts, the largest categories of mood in
languages are symbolic and English positive comments (Fig. 52a). It can be seen that in the sample
of political commentaries there were significantly longer comments than in the previous categories
(Fig. 52b). The solidarity index for political posts was considerably lower than for other categories
(Fig. 53). The graph also shows that most of the solidarity comments are symbolic (Fig. 54).</p>
      <p>Let's move on to sentiment analysis (Fig. 55–59). You can see that here, unlike other categories,
there are clear peaks and troughs on the chart, indicating a high correlation between political
events and the mood of comments. It is also worth noting that for comment schedule in Ukrainian
and all languages, there is a strong trend towards improving the comments mood over time.</p>
    </sec>
    <sec id="sec-7">
      <title>9. Conclusions</title>
      <p>As a result of the study, an integrated approach to the automated analysis of Instagram comments
was developed, which considers the multilingual nature of content, the dynamism of social
networks, and the characteristics of the informal online communication environment. The
proposed system provides a comprehensive cycle of data processing, encompassing automatic
parsing of comments, language detection, sentiment analysis, time trend analysis, and interactive
visualisation of results. Based on a critical analysis of existing solutions, it was found that most
commercial and open source tools demonstrate low accuracy for local languages, in particular
Ukrainian, work with short and mixed texts to a limited extent, and do not take into account the
specifics of social networks (emojis, slang, symbols). It confirmed the need to create a specialised
system capable of adapting to the real-world conditions of data processing on social platforms.</p>
      <p>As part of the study, a method for multilingual analysis of Instagram comments using
specialised transformer models, specifically XLM-RoBERTa and individual models for specific
language groups, was proposed for the first time. It enabled the achievement of high accuracy in
determining moods for Ukrainian, Russian, and English, as well as in processing content consisting
only of symbols or emojis. The developed algorithms for identifying the language group and
assessing the solidarity of comments with posts provide a deeper contextual analysis of user
reactions. An additional scientific result is a technique for analysing mood dynamics over time,
which combines time series models, moving averages, and clustering. It enables you to identify
changes in the audience's emotional response, predict trends, and assess long-term engagement
with content. Visualising results using interactive tools enhances the practical value of the system
for businesses, researchers, analysts, and organisations. The created system is relevant and
significant for Ukraine, as it contributes to the development of local NLP solutions, supports the
analysis of Ukrainian-language content, enables the monitoring of public sentiment, counters
disinformation, and provides new opportunities for business intelligence.</p>
      <p>Thus, the developed system demonstrates high efficiency in the multilingual analysis of texts
from social networks, opening up prospects for further improvement, including expanding
language support, integrating new transformer models, and enhancing the accuracy of analysing
emotional and semantic characteristics of comments.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vysotska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nazarkevych</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vladov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chyrun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lozynska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavrut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Budz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Muzychuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nagachevska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Diakun</surname>
          </string-name>
          ,
          <article-title>Information technology for promoting Instagram accounts</article-title>
          ,
          <source>in: Proceedings of the Computational Intelligence Application Workshop</source>
          , CIAW '
          <year>2024</year>
          , CEUR Workshop Proceedings, Aachen, Germany,
          <year>2024</year>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>266</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vysotska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Starchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chyrun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ushenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Uhryn</surname>
          </string-name>
          ,
          <article-title>Sentiment analysing and visualising public opinion on political figures across YouTube and Twitter using NLP and machine learning</article-title>
          ,
          <source>IJIGSP</source>
          <volume>17</volume>
          (
          <issue>5</issue>
          ) (
          <year>2025</year>
          )
          <fpage>117</fpage>
          -
          <lpage>164</lpage>
          . doi:
          <volume>10</volume>
          .5815/ijigsp.
          <year>2025</year>
          .
          <volume>05</volume>
          .08.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Uhryn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vysotska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chyrun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chyrun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ushenko</surname>
          </string-name>
          ,
          <article-title>Intelligent application for textual content authorship identification based on machine learning and sentiment analysis</article-title>
          ,
          <source>IJISA</source>
          <volume>17</volume>
          (
          <issue>2</issue>
          ) (
          <year>2025</year>
          )
          <fpage>56</fpage>
          -
          <lpage>100</lpage>
          . doi:
          <volume>10</volume>
          .5815/ijisa.
          <year>2025</year>
          .
          <volume>02</volume>
          .05.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Holubinka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vysotska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vladov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ushenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Talakh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tomka</surname>
          </string-name>
          ,
          <article-title>Intelligent system for recognizing tone and categorizing text in media news at an electronic business based on sentiment and sarcasm analysis</article-title>
          ,
          <source>IJIEEB</source>
          <volume>17</volume>
          (
          <issue>1</issue>
          ) (
          <year>2025</year>
          )
          <fpage>90</fpage>
          -
          <lpage>139</lpage>
          . doi:
          <volume>10</volume>
          .5815/ijieeb.
          <year>2025</year>
          .
          <volume>01</volume>
          .06.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. N. Wa</given-names>
            <surname>Nkongolo</surname>
          </string-name>
          ,
          <article-title>News classification and categorization with smart function sentiment analysis</article-title>
          ,
          <source>Int. J. Intell. Syst</source>
          .
          <year>2023</year>
          (
          <article-title>1) (</article-title>
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1155/
          <year>2023</year>
          /1784394.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Uddin</surname>
          </string-name>
          ,
          <article-title>Leveraging textual information for social media news categorization and sentiment analysis</article-title>
          ,
          <source>PLOS ONE 19 (7)</source>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0307027</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>A pragmatic and intelligent model for sarcasm detection in social media text</article-title>
          ,
          <source>Technol. Soc</source>
          .
          <volume>64</volume>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1016/j.techsoc.
          <year>2020</year>
          .
          <volume>101489</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>U.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. W.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Srivastava, Emotional intelligence attention unsupervised learning using lexicon analysis for irony-based advertising</article-title>
          ,
          <source>TALLIP</source>
          <volume>23</volume>
          (
          <issue>1</issue>
          ) (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>19</lpage>
          . doi:
          <volume>10</volume>
          .1145/3580496.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Iddrisu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mensah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Boafo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. R.</given-names>
            <surname>Yeluripati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kudjo</surname>
          </string-name>
          ,
          <article-title>A sentiment analysis framework to classify instances of sarcastic sentiments within the aviation sector</article-title>
          ,
          <source>Int. J. Inf. Manag. Data Insights</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ) (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1016/j.jjimei.
          <year>2023</year>
          .
          <volume>100180</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>A. D. Yacoub</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Slim</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Aboutabl</surname>
          </string-name>
          ,
          <article-title>A survey of sentiment analysis and sarcasm detection: Challenges, techniques, and trends</article-title>
          ,
          <source>IJECES</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ) (
          <year>2024</year>
          )
          <fpage>69</fpage>
          -
          <lpage>78</lpage>
          . doi:
          <volume>10</volume>
          .32985/ijeces.15.
          <issue>1</issue>
          .7.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Ahmed, Classification, detection and sentiment analysis using machine learning over next generation communication platforms</article-title>
          ,
          <source>Microprocessors and Microsystems</source>
          <volume>98</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1016/j.micpro.
          <year>2023</year>
          .
          <volume>104795</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C. I.</given-names>
            <surname>Eke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Norman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shuib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. F.</given-names>
            <surname>Nweke</surname>
          </string-name>
          ,
          <article-title>Sarcasm identification in textual data: Systematic review, research challenges and open directions</article-title>
          ,
          <source>Artif. Intell. Rev</source>
          .
          <volume>53</volume>
          (
          <issue>6</issue>
          ) (
          <year>2020</year>
          )
          <fpage>4215</fpage>
          -
          <lpage>4258</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10462-019-09791-8.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mansoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tahat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. Al</given-names>
            <surname>Zoubi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Tahat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Habes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Himdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Salloum</surname>
          </string-name>
          ,
          <article-title>Detection of sarcasm in news headlines using NLP and machine learning</article-title>
          , in: A.
          <string-name>
            <surname>Al-Marzouqi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Salloum</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Shaalan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gaber</surname>
          </string-name>
          , R. Masa'deh (Eds.),
          <source>Generative AI in Creative Industries</source>
          , Springer, Cham, Switzerland,
          <year>2025</year>
          , pp.
          <fpage>503</fpage>
          -
          <lpage>517</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -89175-5_
          <fpage>31</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pachauri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Alhussan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Abdallah</surname>
          </string-name>
          ,
          <article-title>Sarcasm detection over social media platforms using hybrid ensemble model with fuzzy logic</article-title>
          ,
          <source>Electronics</source>
          <volume>12</volume>
          (
          <issue>4</issue>
          ) (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .3390/electronics12040937.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. R. W.</given-names>
            <surname>Sait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Ishak</surname>
          </string-name>
          ,
          <article-title>Deep learning with natural language processing enabled sentimental analysis on sarcasm classification</article-title>
          ,
          <source>Computer Systems Scienceand Engineering</source>
          <volume>44</volume>
          (
          <issue>3</issue>
          ) (
          <year>2023</year>
          )
          <fpage>2553</fpage>
          -
          <lpage>2567</lpage>
          . doi:
          <volume>10</volume>
          .32604/csse.
          <year>2023</year>
          .
          <volume>029603</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>