<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Intelligent Radar for Aragonese Tourism</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rosa M. Montañés-Salas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paula Peña-Larena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>María del Carmen Rodríguez-Hernández</string-name>
          <email>rdelhoyo@itainnova.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis García-Garcés</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Pérez-Benedí</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Mayo-Macías</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrique Meléndez-Estrada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rafael del-Hoyo-Alonso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Luis Galar-Gimeno</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aragon Institute of Technology (ITAINNOVA), María de Luna</institution>
          ,
          <addr-line>7-8, Zaragoza</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tourism of Aragon</institution>
          ,
          <addr-line>Avda. Ranillas, 3A, 3rd floor, Ofice 3D, Zaragoza</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the background and architecture of the Intelligent Radar for Aragonese Tourism (RITA), a data-driven social media surveillance system, which aims to enhance the tourist experience by helping the Aragonese government to make informed data-driven decisions and therefore empowering its tourism sector. The system is built over a customizable platform that integrates multiple data mining techniques to collect, clean, process and extract explicit and implicit knowledge from various sources such as social media networks, web pages, RSS feeds and structured data files. RITA employs state-of-the-art Natural Language Processing technologies combined with data analysis and modelling techniques to analyse social perception of the region and link that information with organizational data. The platform integrates pre-trained and fine-tuned language models based on transformers architectures for solving diferent NLP tasks including opinion and emotion analysis, semantic classification and entities recognition. The knowledge gathered is made available to the tourism professionals via an interactive and customizable web application.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Transformers</kwd>
        <kwd>Social Media</kwd>
        <kwd>Tourism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>lenges, such as multilingualism, the use of formal and
informal expressions, ambiguity and others, must be faced
Social media networks are an integral part of human when using social media data for human-related research.
daily life, they have positioned as one of the most pop- These challenges, coupled with the multimodal nature of
ular forms of communication, entertainment and social user-generated data, may be better addressed by
combinconnection [1]. People generates and shares almost any ing cutting-edge techniques with conventional methods.
type of content regarding their opinions, interests, ex- In this context, ITAINNOVA has re-designed and evolved
periences and desires, which constitutes a huge and in- the solution showcased in [5] that was conceived as a
valuable source of data and knowledge about the human unified self-monitoring system for a particular user, a
behaviour. The tourism domain may be highly benefited place where any individual could stay updated about
from the amount of information shared publicly among its virtual social environment by thoroughly analysing
citizens by being provided with appropriate data intelli- the virtual interactions and extracting implicit
knowlgence tools to discover at first hands and analyse tastes, edge from them. The capability to extract and organize
inclinations and concerns of tourists and other interested valuable information from social networks is primarily
groups of people ([2], [3], [4]). enabled by the use of various natural language
processMultiple Natural Language Processing (NLP) chal- ing techniques, such as semantic categorization, entity
extraction and opinion inference. Moving the focus from
a single user to a more professional setting, makes the
system applicable as a working tool for any public or
private corporation to empower their decision-making
processes. The main objective pursued is to develop a
multimodal data-driven social intelligent platform.</p>
      <p>Both private companies and public administrations are
recognizing the imperative to integrate digitalization and
artificial intelligence (AI) tools into their organizational
processes. Given the significance of the tourism sector
in Aragón and the desire to enhance its robustness, as
well as gain profound insights into its users and potential
visitors, Aragonese Tourism embarked on a collaborative
endeavour with ITAINNOVA to develop a decision
support tool based on the data-driven social platform. The
primary objective of this system is to aggregate citizens
information from diverse sources and modalities. While
surveys and traditional statistical analysis ofer precise
and valuable data, it is essential to acknowledge that
individuals convey a wealth of information through
spoken and written language, specially through open social
media platforms, which requires the use of advanced AI
and NLP techniques into the decision-making process [6].</p>
      <p>RITA emerged from the need to better understand the
needs and expectations of the tourism sector in Aragon.</p>
      <p>In this paper, the social media monitoring system for
the Aragonese tourism is presented following this
structure: after the introduction, a description of the general
platform designed is outlined. In section 3, the final
system developed on top of the social media platform is
presented. And the last section concludes with an overview
of the main outcomes achieved, as well as suggestions
for further improvements.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Social media platform</title>
      <p>ITAINNOVA has designed a highly scalable and
customizable system aiming to fit the needs of current social and Figure 1: Operational architecture.
marketing research, named Social Media platform. The
long-term objective is to work as a base architecture to
integrate heterogeneous data sources, which are curated, mation to be retrieved, the data types supported and the
fused and modelled by means of advanced artificial in- feasibility of their inclusion.
telligence techniques, deploying a wide range of smart The current state of development of the platform
supservices and allowing to develop data resources includ- ports intake of textual and numerical data from diverse
ing datasets and in-domain knowledge models (large sources: social media networks, web pages, RSS feeds
language models, multimodal models and data models). and structured data files. The primary social network
The operational architecture is shown in figure 1. integrated is Twitter, which exposes a public API
com</p>
      <p>This platform is designed with the vision of serving as posed of a set of endpoints that permit to explore and
a decision support tool and as a domain knowledge repos- manage Twitter entities, depending on the access level
itory. The core of the system corresponds to the block authorized. A monitoring system will typically require
identified as Social-Media together with all the data and tracking of the most recent events in the domain under
models repositories displayed in the lower part. Social- research, thus, by default, standard endpoints will be
Media implements and exposes a series of microservices queried.
which are consumed by the client module (“Client API”), In order to store all this information, a homogenized
responsible for adapting and executing the customized data model has been designed on a document-oriented
processes for the specific domain. database engine. This model is composed of two
dif</p>
      <p>As a general guideline, the platform is designed over ferent entities: the first one called “social-networks” in
the data mining main pillars: gather useful informa- which the information on publications from the sources
tion; process, clean and obtain relevant data; analyse described above (data and metadata) is structured and
and model the explicit and implicit knowledge contained stored, in which the diferential concepts of these sources
in the data; and disseminate results. An overview of the have been homogenized; and the second one called “users”
implementation of these steps is presented in this section. in which the public information of the authors of the
publications is linked.</p>
      <sec id="sec-2-1">
        <title>2.1. Information retrieval</title>
        <p>The first step in the development is to properly establish
the focus of the information to be retrieved and anal- The following phases of data cleaning, processing and
ysed, i.e. to determine which information sources will be modelling are crucial in order to obtain useful
informaincluded in the system, the specific domain of the infor- tion, implicit knowledge, patterns and trends on large
2.2. No-NLP: Not only NLP
amounts of data. At this stage, two main approaches ically extracting from the textual sequence a set of terms
are considered: applying diferent state-of-the-art algo- of interest referring to a given concept (entity).
Dependrithms on the core data: the text interactions, and then ing on the working domain, the entities defined may
exploit relations with the rest of the analytical data to vary, but for a general perspective, a standard NER on
deduce higher level associations. Therefore, primarily places, organizations and persons is applied. An already
transformer-based NLP techniques have been applied ifne-tuned multilingual BERT entity recognizer is
inteto analyse textual content [7], identify salient elements grated in the system. In this task, both the sentence’s
and infer information at the semantic level, as described grammar and lexical ambiguity must be taken into
acfollowing. count in order to get valid results. To overcome possible</p>
        <p>After the extraction of social media content, a pre- mistakes produced on the lexical level, the acknowledged
ifltering step is performed based on the keywords config- approach based on gazetteers is still considered at the
ured for searching and the metadata retrieved, applying end of the NER processing, by making use of the internal
specific domain rules. Over the most relevant candidates and external domain-knowledge available.
obtained from this filtering process, a data processing Semantic classification refers to the task of assigning
pipeline is applied for extracting structured and mean- one or more predefined categories to a statement
accordingful information, as depicted in figure 2. ing to its overall meaning and the subject to which it
refers. This classification will make it possible to group
documents and organize large amounts of information
regarding their semantic interpretation. In the first
useroriented version of the monitoring system,
ontologybased strategies were explored and implemented. They
allow collecting terms and concepts from specific
domains in a structured way that can be explored during
text analysis by means of diferent algorithms, enabling
the assignation of a conceptual category to that
document or profiling users [ 11]. Results obtained were quite
reasonable, although this approach required a periodic
update of the knowledge bases used (ontologies,
dictionaries, thesauri, lists, etc.) as the data sources change
Figure 2: Information processing. over time. To improve this process, leveraging the power
of Transformers that exploit to the maximum the concept
of transfer learning [12] was studied and tested, including</p>
        <p>Opinion analysis tries to identify and extract subjec- finally a multilingual zero-shot classifier ifne-tuned on
tive information expressed in human linguistic produc- the XLM-RoBERTa-Large pre-trained model [13] on a
tion (spoken and written), that is, it tries to determine the combination of data from 15 languages, such as English,
attitude of the interlocutor with respect to the general French, Spanish, German, Hindi, etc.
topic or particular aspects of a message. This attitude Geocoding: the functionalities of the platform include
can be classified in diferent ways, in this case a three- a geolocation service that allows finding the information
level categorical evaluation has been established: posi- associated with a location, expressed through its
coorditive, negative and neutral. A multilingual XLM-RoBERTa nates or through its usual name (i.e.: Sos del Rey Católico).
ifne-tuned model for twitter sentiment analysis [ 8] has This component is implemented through queries to the
been integrated in the system for this submodule. open source service Nominatim1, which makes use of</p>
        <p>Emotion analysis: alongside the previous analysis, it OpenStreetMap data. The names of locations, from which
is feasible to perform an analysis of the emotions and its geo-positioning is desired, are retrieved from the
texfeelings expressed over a topic in written and verbal com- tual publications mentions by invoking the NER service.
munication [9]. In this case, the model has to distinguish Further data analysis is applied at the end of the data
more precise sentiments and assign one or several corre- processing pipeline. Statistical analysis on numerical
sponding tags to the documents. The tags selected are: data extracted from the social networks is performed, by
like, love, haha, wow, sad and angry, inspired in Face- means of computing standard metrics including means,
book’s reactions options. A Spanish multilabel dataset, deviations, comparison among diferent variables sample
consisting of more than two thousand documents re- sizes, text frequencies, common pattern and so forth.
trieved from social networks in the past, was used to Additionally, these techniques are applied over diferent
ifne-tune the multilingual BERT [ 10] developed in this cross-data sets as well, as for example the set obtained
module.</p>
        <p>Named-Entity Recognition (NER) consists of
automatby joining the number of documents which are speaking Opinion and emotion analysis are maintained as
exabout a topic with the opinion the user is expressing plained in section 2, while the NER results are curated in
about that topic. order to extract more precise results. Entities extracted
for each class are automatically reviewed applying some
2.3. Data exploitation in-domain rules devised by tourist specialists: places
identified are geolocated and matched with information
The results produced during the previous stages should on villages and regions; likewise, people names retrieved
be analysed and shared with the interested parties. The are examined in detail, as it was found that a number of
Social-Media platform is designed to be highly flexible surnames match locations names in Spanish; moreover,
by exposing a set of web-services through a simple REST- all the entities are filtered in terms of character length
API that can be invoked from any compatible client that and social network related stopwords in order to discard
enables to integrate seamlessly the results on a web appli- meaningless information.
cation, external business intelligence systems or in client Semantic classification is particularly customized for
reports, depending on the use case defined by the user. the use case. Two diferent criteria are considered to
structure the content based on its meaning: products and
3. RITA profiles. The group of products identifies the typology of
the tourist ofer mentioned, they refer to a combination
The acronym RITA comes from its Spanish initials: Radar of places, festivities, activities, natural resources, material
Inteligente de Turismo de Aragón (Intelligent Radar for and immaterial attractions, etc. The following products
Aragonese Tourism) which stems from the collabora- are defined within RITA: active tourism, rural tourism,
tion between the Aragon Institute of Technology (ITAIN- popular festivities, culture and gastronomy. Profiles
reNOVA) with the Society for the Promotion and Manage- fer to the characteristics of interest that resembles the
ment of Aragonese Tourism (Aragonese Tourism orga- attitude and aspects that tourists demand when visiting
nization from now on). Both entities are cooperating diferent places in the autonomous community. The
techon researching and executing new innovative tourism nical experts from Aragonese Tourism corporation have
strategies based on digitalization for the development selected this labels: safety, comfort, treasury and
qualand improvement of the tourist sector in their region. ity. The zero-shot classifier fits these classification needs
Listening to the social needs and tastes directly and in a with ease, as up to our knowledge, there was no public,
non-intrusive way, letting them express freely and com- multilingual or Spanish, dataset available to fine-tune a
fortably from any place, can only be reached by consult- language model classifier and the labels considered may
ing public intercommunication spaces such as the Inter- vary regarding the context.
net and social media networks. Fusing that information
with data retrieved from tourism ofices, touristic places 3.2. Web interface
visiting reports and open data knowledge, would lead
experts to comprehend thoroughly which aspects influence
the current touristic panorama and which ones should
be attended to then design and ofer new tourist
experiences. With this aim, Aragonese Tourism corporation
and ITAINNOVA begin the designing and development
of their intelligent radar.</p>
        <p>A web application has been designed and published,
which includes a dashboard displaying all the processed
information and diferent interactive components that
allow filtering the results to visualize the desired contents
in an interactive way. It includes a timeline of
publications that can be sorted by date, relevance, number of
“likes” and number of retweets. On one side there is a set
of possible filters to be applied: dates, sources, opinion,
categories, profiles and tags (regions and municipalities
searched in social networks) as depicted in figure 3. On
the other side are displayed diferent statistics of users
who have published along with several graphs that allow
to know the average opinion of the publications, the
evaluation of the aspects considered from the tourist point of
view (profile), most used hashtags, the identified places
where the publications have been made and which places
have been mentioned, as well as organizations and people
detected in the texts in form of word clouds (see figure
4).</p>
        <sec id="sec-2-1-1">
          <title>The use of the application is intended for technical and management staf of the Aragonese Tourism corpo</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.1. Tourism domain</title>
        <sec id="sec-2-2-1">
          <title>For the use case and at the current phase, two information</title>
          <p>sources have been considered: social networks (Twitter)
and web pages (blog posts, RSS feeds).</p>
          <p>Pre-filtering and post-filtering stages are adapted to
this particular domain: content related to Aragonese
towns, regions and touristic places. In this sense,
metadata is examined in order to check whether the query has
returned mistaken publications and discard data in that
case. Moreover, NER locations identified and geolocated
are matched against searches configured and filters those
that appear clearly out of the scope.
ration, to be incorporated as a dynamic consulting tool
where they might explore news from their action area,
generate periodical reports, obtain an overall vision of
the typology and expectations of potential visitors and,
ultimately, get real snapshots of the current situation of
the sector in Aragon from more personal perspectives.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Conclusions and future work</title>
      <sec id="sec-3-1">
        <title>The Intelligent Radar for Aragonese Tourism, RITA, has</title>
        <p>been introduced in this paper. It is built on top of the
Social-Media platform designed by ITAINNOVA, which
aim is to evolve to a multimodal data-driven social
intelligent platform used to support and enhance any kind
of advanced decision-making tools. The radar emerges
as a digital solution to empower the tourist sector in
Aragon, with the objectives of monitoring social
perception of the region and its touristic attractions, and link
that information with organizational data by applying
natural language processing and other state-of-the-art
techniques to detect current situation of the sector and
enhance tourist experience. It can be stated that the RITA
solution enables the Aragonese government to make
informed decisions based on data, which reduces the cost
of data acquisition. The platform is capable of gathering
valuable information from heterogeneous sources,
focused on user-generated content, which provides a more
comprehensive understanding of the tourism sector in
Aragon. This data-driven approach empowers the
government to make smart decisions that can improve the
tourism experience and eventually promote the region’s
growth.</p>
        <p>These objectives are fulfilled by the use of a wide range
of language technologies combined with data analysis
and machine learning techniques. The platform
integrates several pre-trained and fine-tuned language
models mainly based on transformer architectures for solving
diferent NLP tasks, such as opinion analysis, emotion
analysis, named-entity recognition and semantic
classiifcation. It also relies on in-domain knowledge in the
form of rule-based lfiters and gazetteers, to improve the
insights retrieved from the machine learning processes.</p>
        <p>Combining diferent data types and statistical analysis
allow capturing implicit relations in the information that
feeds the system.</p>
        <p>Nonetheless, there is still room for improvement and
some future work lines are considered: extending the
number and modalities of data sources is the main
target defined in the platform roadmap. Adding external
and internal data would lay a foundation to
incorporate advanced multimodal transformer models for
enriching hidden pattern discovery and gather more data
which, adequately analysed and modelled, will answer
more complex questions to the tourism technical experts.</p>
        <p>Some types of data being under evaluation are public
customer reviews, public accessible reports from the
National Statistics Institute and other regional tourist
activity reports. With the emergence of generative large
language models [14], such as GPT, a new landscape
opens up in the data analysis and knowledge inference,
which will be leveraged in following updates of the
platform. On the end-user side, the web interface will append
new specific dashboards to analyse more easily particu- [11] P. Peña, R. Del Hoyo, J. Vea-Murguía, C. González,
lar aspects such as the opinions and emotions associated S. Mayo, Collective knowledge ontology user
prowith a certain region or product, temporal statistics and ifling for twitter–automatic user profiling, in: 2013
trends or social interactions. IEEE/WIC/ACM International Joint Conferences on
Web Intelligence (WI) and Intelligent Agent
Technologies (IAT), volume 1, IEEE, 2013, pp. 439–444.</p>
        <p>Acknowledgments [12] R. Qasim, W. H. Bangyal, M. A. Alqarni, A. Ali
Almazroi, et al., A fine-tuned bert-based transfer
This work has been partially funded by the Department learning approach for text classification, Journal of
of Big Data and Cognitive Systems at the Technological healthcare engineering 2022 (2022).
Institute of Aragon, by IODIDE group of the Government [13] A. Conneau, K. Khandelwal, N. Goyal, V.
Chaudof Aragon, grant number T1720R and by the European hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
Regional Development Fund (ERDF). L. Zettlemoyer, V. Stoyanov, Unsupervised
crosslingual representation learning at scale, CoRR
References abs/1911.02116 (2019). URL: http://arxiv.org/abs/
1911.02116. arXiv:1911.02116.
[1] E. Ortiz-Ospina, M. Roser, The rise of social media, [14] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou,</p>
        <p>Our world in data (2023). Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A
[2] A. Huertas Herrera, M. D. Toro-Manríquez, survey of large language models, arXiv preprint
R. Soler Esteban, C. Lorenzo, M. V. Lencinas, arXiv:2303.18223 (2023).</p>
        <p>G. Martínez Pastur, Social media reveal visitors’
interest in flora and fauna species of a forest region,</p>
        <p>Ecosystems and People 19 (2023) 2155248.
[3] R. Nunkoo, D. Gursoy, Y. K. Dwivedi, Efects of
social media on residents’ attitudes to tourism:
Conceptual framework and research propositions,
Journal of Sustainable Tourism 31 (2023) 350–366.
[4] F. J. Lacarcel, R. Huete, Digital
communication strategies used by private companies,
entrepreneurs, and public entities to attract long-stay
tourists: a review, International Entrepreneurship
and Management Journal (2023) 1–18.
[5] R. Montanés, R. Aznar, S. Nogueras, P. Segura,</p>
        <p>R. Langarita, E. Meléndez, P. Pena, R. Del Hoyo,
Monitorización de social media, Procesamiento del</p>
        <p>Lenguaje Natural 61 (2018) 177–180.
[6] N. A. Alghamdi, H. H. Al-Baity, Augmented
analytics driven by ai: A digital transformation beyond
business intelligence, Sensors 22 (2022) 8071.
[7] L. Tunstall, L. Von Werra, T. Wolf, Natural language
processing with transformers, " O’Reilly Media,</p>
        <p>Inc.", 2022.
[8] F. Barbieri, L. E. Anke, J. Camacho-Collados,
XLM</p>
        <p>T: A multilingual language model toolkit for twitter,
CoRR abs/2104.12250 (2021). URL: https://arxiv.org/
abs/2104.12250. arXiv:2104.12250.
[9] F. A. Acheampong, H. Nunoo-Mensah, W. Chen,</p>
        <p>Transformer models for text-based emotion
detection: a review of bert-based approaches, Artificial</p>
        <p>Intelligence Review (2021) 1–41.
[10] J. Devlin, M. Chang, K. Lee, K. Toutanova,</p>
        <p>BERT: pre-training of deep bidirectional
transformers for language understanding, CoRR
abs/1810.04805 (2018). URL: http://arxiv.org/abs/
1810.04805. arXiv:1810.04805.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>