1. Introduction

Intelligent Radar for Aragonese Tourism

Rosa M. Montañés-Salas

Paula Peña-Larena

María del Carmen Rodríguez-Hernández

rdelhoyo@itainnova.es 0

Luis García-Garcés

Pablo Pérez-Benedí

Sergio Mayo-Macías

Enrique Meléndez-Estrada

Rafael del-Hoyo-Alonso

José Luis Galar-Gimeno

1 0 Aragon Institute of Technology (ITAINNOVA), María de Luna , 7-8, Zaragoza , Spain 1 Tourism of Aragon , Avda. Ranillas, 3A, 3rd floor, Ofice 3D, Zaragoza , Spain

This paper describes the background and architecture of the Intelligent Radar for Aragonese Tourism (RITA), a data-driven social media surveillance system, which aims to enhance the tourist experience by helping the Aragonese government to make informed data-driven decisions and therefore empowering its tourism sector. The system is built over a customizable platform that integrates multiple data mining techniques to collect, clean, process and extract explicit and implicit knowledge from various sources such as social media networks, web pages, RSS feeds and structured data files. RITA employs state-of-the-art Natural Language Processing technologies combined with data analysis and modelling techniques to analyse social perception of the region and link that information with organizational data. The platform integrates pre-trained and fine-tuned language models based on transformers architectures for solving diferent NLP tasks including opinion and emotion analysis, semantic classification and entities recognition. The knowledge gathered is made available to the tourism professionals via an interactive and customizable web application.

eol>Natural Language Processing Transformers Social Media Tourism

1. Introduction

lenges, such as multilingualism, the use of formal and informal expressions, ambiguity and others, must be faced Social media networks are an integral part of human when using social media data for human-related research. daily life, they have positioned as one of the most pop- These challenges, coupled with the multimodal nature of ular forms of communication, entertainment and social user-generated data, may be better addressed by combinconnection [1]. People generates and shares almost any ing cutting-edge techniques with conventional methods. type of content regarding their opinions, interests, ex- In this context, ITAINNOVA has re-designed and evolved periences and desires, which constitutes a huge and in- the solution showcased in [5] that was conceived as a valuable source of data and knowledge about the human unified self-monitoring system for a particular user, a behaviour. The tourism domain may be highly benefited place where any individual could stay updated about from the amount of information shared publicly among its virtual social environment by thoroughly analysing citizens by being provided with appropriate data intelli- the virtual interactions and extracting implicit knowlgence tools to discover at first hands and analyse tastes, edge from them. The capability to extract and organize inclinations and concerns of tourists and other interested valuable information from social networks is primarily groups of people ([2], [3], [4]). enabled by the use of various natural language processMultiple Natural Language Processing (NLP) chal- ing techniques, such as semantic categorization, entity extraction and opinion inference. Moving the focus from a single user to a more professional setting, makes the system applicable as a working tool for any public or private corporation to empower their decision-making processes. The main objective pursued is to develop a multimodal data-driven social intelligent platform.

Both private companies and public administrations are recognizing the imperative to integrate digitalization and artificial intelligence (AI) tools into their organizational processes. Given the significance of the tourism sector in Aragón and the desire to enhance its robustness, as well as gain profound insights into its users and potential visitors, Aragonese Tourism embarked on a collaborative endeavour with ITAINNOVA to develop a decision support tool based on the data-driven social platform. The primary objective of this system is to aggregate citizens information from diverse sources and modalities. While surveys and traditional statistical analysis ofer precise and valuable data, it is essential to acknowledge that individuals convey a wealth of information through spoken and written language, specially through open social media platforms, which requires the use of advanced AI and NLP techniques into the decision-making process [6].

RITA emerged from the need to better understand the needs and expectations of the tourism sector in Aragon.

In this paper, the social media monitoring system for the Aragonese tourism is presented following this structure: after the introduction, a description of the general platform designed is outlined. In section 3, the final system developed on top of the social media platform is presented. And the last section concludes with an overview of the main outcomes achieved, as well as suggestions for further improvements.

2. Social media platform

ITAINNOVA has designed a highly scalable and customizable system aiming to fit the needs of current social and Figure 1: Operational architecture. marketing research, named Social Media platform. The long-term objective is to work as a base architecture to integrate heterogeneous data sources, which are curated, mation to be retrieved, the data types supported and the fused and modelled by means of advanced artificial in- feasibility of their inclusion. telligence techniques, deploying a wide range of smart The current state of development of the platform supservices and allowing to develop data resources includ- ports intake of textual and numerical data from diverse ing datasets and in-domain knowledge models (large sources: social media networks, web pages, RSS feeds language models, multimodal models and data models). and structured data files. The primary social network The operational architecture is shown in figure 1. integrated is Twitter, which exposes a public API com

This platform is designed with the vision of serving as posed of a set of endpoints that permit to explore and a decision support tool and as a domain knowledge repos- manage Twitter entities, depending on the access level itory. The core of the system corresponds to the block authorized. A monitoring system will typically require identified as Social-Media together with all the data and tracking of the most recent events in the domain under models repositories displayed in the lower part. Social- research, thus, by default, standard endpoints will be Media implements and exposes a series of microservices queried. which are consumed by the client module (“Client API”), In order to store all this information, a homogenized responsible for adapting and executing the customized data model has been designed on a document-oriented processes for the specific domain. database engine. This model is composed of two dif

As a general guideline, the platform is designed over ferent entities: the first one called “social-networks” in the data mining main pillars: gather useful informa- which the information on publications from the sources tion; process, clean and obtain relevant data; analyse described above (data and metadata) is structured and and model the explicit and implicit knowledge contained stored, in which the diferential concepts of these sources in the data; and disseminate results. An overview of the have been homogenized; and the second one called “users” implementation of these steps is presented in this section. in which the public information of the authors of the publications is linked.

2.1. Information retrieval

The first step in the development is to properly establish the focus of the information to be retrieved and anal- The following phases of data cleaning, processing and ysed, i.e. to determine which information sources will be modelling are crucial in order to obtain useful informaincluded in the system, the specific domain of the infor- tion, implicit knowledge, patterns and trends on large 2.2. No-NLP: Not only NLP amounts of data. At this stage, two main approaches ically extracting from the textual sequence a set of terms are considered: applying diferent state-of-the-art algo- of interest referring to a given concept (entity). Dependrithms on the core data: the text interactions, and then ing on the working domain, the entities defined may exploit relations with the rest of the analytical data to vary, but for a general perspective, a standard NER on deduce higher level associations. Therefore, primarily places, organizations and persons is applied. An already transformer-based NLP techniques have been applied ifne-tuned multilingual BERT entity recognizer is inteto analyse textual content [7], identify salient elements grated in the system. In this task, both the sentence’s and infer information at the semantic level, as described grammar and lexical ambiguity must be taken into acfollowing. count in order to get valid results. To overcome possible

After the extraction of social media content, a pre- mistakes produced on the lexical level, the acknowledged ifltering step is performed based on the keywords config- approach based on gazetteers is still considered at the ured for searching and the metadata retrieved, applying end of the NER processing, by making use of the internal specific domain rules. Over the most relevant candidates and external domain-knowledge available. obtained from this filtering process, a data processing Semantic classification refers to the task of assigning pipeline is applied for extracting structured and mean- one or more predefined categories to a statement accordingful information, as depicted in figure 2. ing to its overall meaning and the subject to which it refers. This classification will make it possible to group documents and organize large amounts of information regarding their semantic interpretation. In the first useroriented version of the monitoring system, ontologybased strategies were explored and implemented. They allow collecting terms and concepts from specific domains in a structured way that can be explored during text analysis by means of diferent algorithms, enabling the assignation of a conceptual category to that document or profiling users [ 11]. Results obtained were quite reasonable, although this approach required a periodic update of the knowledge bases used (ontologies, dictionaries, thesauri, lists, etc.) as the data sources change Figure 2: Information processing. over time. To improve this process, leveraging the power of Transformers that exploit to the maximum the concept of transfer learning [12] was studied and tested, including

Opinion analysis tries to identify and extract subjec- finally a multilingual zero-shot classifier ifne-tuned on tive information expressed in human linguistic produc- the XLM-RoBERTa-Large pre-trained model [13] on a tion (spoken and written), that is, it tries to determine the combination of data from 15 languages, such as English, attitude of the interlocutor with respect to the general French, Spanish, German, Hindi, etc. topic or particular aspects of a message. This attitude Geocoding: the functionalities of the platform include can be classified in diferent ways, in this case a three- a geolocation service that allows finding the information level categorical evaluation has been established: posi- associated with a location, expressed through its coorditive, negative and neutral. A multilingual XLM-RoBERTa nates or through its usual name (i.e.: Sos del Rey Católico). ifne-tuned model for twitter sentiment analysis [ 8] has This component is implemented through queries to the been integrated in the system for this submodule. open source service Nominatim1, which makes use of

Emotion analysis: alongside the previous analysis, it OpenStreetMap data. The names of locations, from which is feasible to perform an analysis of the emotions and its geo-positioning is desired, are retrieved from the texfeelings expressed over a topic in written and verbal com- tual publications mentions by invoking the NER service. munication [9]. In this case, the model has to distinguish Further data analysis is applied at the end of the data more precise sentiments and assign one or several corre- processing pipeline. Statistical analysis on numerical sponding tags to the documents. The tags selected are: data extracted from the social networks is performed, by like, love, haha, wow, sad and angry, inspired in Face- means of computing standard metrics including means, book’s reactions options. A Spanish multilabel dataset, deviations, comparison among diferent variables sample consisting of more than two thousand documents re- sizes, text frequencies, common pattern and so forth. trieved from social networks in the past, was used to Additionally, these techniques are applied over diferent ifne-tune the multilingual BERT [ 10] developed in this cross-data sets as well, as for example the set obtained module.

Named-Entity Recognition (NER) consists of automatby joining the number of documents which are speaking Opinion and emotion analysis are maintained as exabout a topic with the opinion the user is expressing plained in section 2, while the NER results are curated in about that topic. order to extract more precise results. Entities extracted for each class are automatically reviewed applying some 2.3. Data exploitation in-domain rules devised by tourist specialists: places identified are geolocated and matched with information The results produced during the previous stages should on villages and regions; likewise, people names retrieved be analysed and shared with the interested parties. The are examined in detail, as it was found that a number of Social-Media platform is designed to be highly flexible surnames match locations names in Spanish; moreover, by exposing a set of web-services through a simple REST- all the entities are filtered in terms of character length API that can be invoked from any compatible client that and social network related stopwords in order to discard enables to integrate seamlessly the results on a web appli- meaningless information. cation, external business intelligence systems or in client Semantic classification is particularly customized for reports, depending on the use case defined by the user. the use case. Two diferent criteria are considered to structure the content based on its meaning: products and 3. RITA profiles. The group of products identifies the typology of the tourist ofer mentioned, they refer to a combination The acronym RITA comes from its Spanish initials: Radar of places, festivities, activities, natural resources, material Inteligente de Turismo de Aragón (Intelligent Radar for and immaterial attractions, etc. The following products Aragonese Tourism) which stems from the collabora- are defined within RITA: active tourism, rural tourism, tion between the Aragon Institute of Technology (ITAIN- popular festivities, culture and gastronomy. Profiles reNOVA) with the Society for the Promotion and Manage- fer to the characteristics of interest that resembles the ment of Aragonese Tourism (Aragonese Tourism orga- attitude and aspects that tourists demand when visiting nization from now on). Both entities are cooperating diferent places in the autonomous community. The techon researching and executing new innovative tourism nical experts from Aragonese Tourism corporation have strategies based on digitalization for the development selected this labels: safety, comfort, treasury and qualand improvement of the tourist sector in their region. ity. The zero-shot classifier fits these classification needs Listening to the social needs and tastes directly and in a with ease, as up to our knowledge, there was no public, non-intrusive way, letting them express freely and com- multilingual or Spanish, dataset available to fine-tune a fortably from any place, can only be reached by consult- language model classifier and the labels considered may ing public intercommunication spaces such as the Inter- vary regarding the context. net and social media networks. Fusing that information with data retrieved from tourism ofices, touristic places 3.2. Web interface visiting reports and open data knowledge, would lead experts to comprehend thoroughly which aspects influence the current touristic panorama and which ones should be attended to then design and ofer new tourist experiences. With this aim, Aragonese Tourism corporation and ITAINNOVA begin the designing and development of their intelligent radar.

A web application has been designed and published, which includes a dashboard displaying all the processed information and diferent interactive components that allow filtering the results to visualize the desired contents in an interactive way. It includes a timeline of publications that can be sorted by date, relevance, number of “likes” and number of retweets. On one side there is a set of possible filters to be applied: dates, sources, opinion, categories, profiles and tags (regions and municipalities searched in social networks) as depicted in figure 3. On the other side are displayed diferent statistics of users who have published along with several graphs that allow to know the average opinion of the publications, the evaluation of the aspects considered from the tourist point of view (profile), most used hashtags, the identified places where the publications have been made and which places have been mentioned, as well as organizations and people detected in the texts in form of word clouds (see figure 4).

The use of the application is intended for technical and management staf of the Aragonese Tourism corpo 3.1. Tourism domain For the use case and at the current phase, two information

sources have been considered: social networks (Twitter) and web pages (blog posts, RSS feeds).

Pre-filtering and post-filtering stages are adapted to this particular domain: content related to Aragonese towns, regions and touristic places. In this sense, metadata is examined in order to check whether the query has returned mistaken publications and discard data in that case. Moreover, NER locations identified and geolocated are matched against searches configured and filters those that appear clearly out of the scope. ration, to be incorporated as a dynamic consulting tool where they might explore news from their action area, generate periodical reports, obtain an overall vision of the typology and expectations of potential visitors and, ultimately, get real snapshots of the current situation of the sector in Aragon from more personal perspectives.

4. Conclusions and future work The Intelligent Radar for Aragonese Tourism, RITA, has

been introduced in this paper. It is built on top of the Social-Media platform designed by ITAINNOVA, which aim is to evolve to a multimodal data-driven social intelligent platform used to support and enhance any kind of advanced decision-making tools. The radar emerges as a digital solution to empower the tourist sector in Aragon, with the objectives of monitoring social perception of the region and its touristic attractions, and link that information with organizational data by applying natural language processing and other state-of-the-art techniques to detect current situation of the sector and enhance tourist experience. It can be stated that the RITA solution enables the Aragonese government to make informed decisions based on data, which reduces the cost of data acquisition. The platform is capable of gathering valuable information from heterogeneous sources, focused on user-generated content, which provides a more comprehensive understanding of the tourism sector in Aragon. This data-driven approach empowers the government to make smart decisions that can improve the tourism experience and eventually promote the region’s growth.

These objectives are fulfilled by the use of a wide range of language technologies combined with data analysis and machine learning techniques. The platform integrates several pre-trained and fine-tuned language models mainly based on transformer architectures for solving diferent NLP tasks, such as opinion analysis, emotion analysis, named-entity recognition and semantic classiifcation. It also relies on in-domain knowledge in the form of rule-based lfiters and gazetteers, to improve the insights retrieved from the machine learning processes.

Combining diferent data types and statistical analysis allow capturing implicit relations in the information that feeds the system.

Nonetheless, there is still room for improvement and some future work lines are considered: extending the number and modalities of data sources is the main target defined in the platform roadmap. Adding external and internal data would lay a foundation to incorporate advanced multimodal transformer models for enriching hidden pattern discovery and gather more data which, adequately analysed and modelled, will answer more complex questions to the tourism technical experts.

Some types of data being under evaluation are public customer reviews, public accessible reports from the National Statistics Institute and other regional tourist activity reports. With the emergence of generative large language models [14], such as GPT, a new landscape opens up in the data analysis and knowledge inference, which will be leveraged in following updates of the platform. On the end-user side, the web interface will append new specific dashboards to analyse more easily particu- [11] P. Peña, R. Del Hoyo, J. Vea-Murguía, C. González, lar aspects such as the opinions and emotions associated S. Mayo, Collective knowledge ontology user prowith a certain region or product, temporal statistics and ifling for twitter–automatic user profiling, in: 2013 trends or social interactions. IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), volume 1, IEEE, 2013, pp. 439–444.

Acknowledgments [12] R. Qasim, W. H. Bangyal, M. A. Alqarni, A. Ali Almazroi, et al., A fine-tuned bert-based transfer This work has been partially funded by the Department learning approach for text classification, Journal of of Big Data and Cognitive Systems at the Technological healthcare engineering 2022 (2022). Institute of Aragon, by IODIDE group of the Government [13] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudof Aragon, grant number T1720R and by the European hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, Regional Development Fund (ERDF). L. Zettlemoyer, V. Stoyanov, Unsupervised crosslingual representation learning at scale, CoRR References abs/1911.02116 (2019). URL: http://arxiv.org/abs/ 1911.02116. arXiv:1911.02116. [1] E. Ortiz-Ospina, M. Roser, The rise of social media, [14] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou,

Our world in data (2023). Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A [2] A. Huertas Herrera, M. D. Toro-Manríquez, survey of large language models, arXiv preprint R. Soler Esteban, C. Lorenzo, M. V. Lencinas, arXiv:2303.18223 (2023).

G. Martínez Pastur, Social media reveal visitors’ interest in flora and fauna species of a forest region,

Ecosystems and People 19 (2023) 2155248. [3] R. Nunkoo, D. Gursoy, Y. K. Dwivedi, Efects of social media on residents’ attitudes to tourism: Conceptual framework and research propositions, Journal of Sustainable Tourism 31 (2023) 350–366. [4] F. J. Lacarcel, R. Huete, Digital communication strategies used by private companies, entrepreneurs, and public entities to attract long-stay tourists: a review, International Entrepreneurship and Management Journal (2023) 1–18. [5] R. Montanés, R. Aznar, S. Nogueras, P. Segura,

R. Langarita, E. Meléndez, P. Pena, R. Del Hoyo, Monitorización de social media, Procesamiento del

Lenguaje Natural 61 (2018) 177–180. [6] N. A. Alghamdi, H. H. Al-Baity, Augmented analytics driven by ai: A digital transformation beyond business intelligence, Sensors 22 (2022) 8071. [7] L. Tunstall, L. Von Werra, T. Wolf, Natural language processing with transformers, " O’Reilly Media,

Inc.", 2022. [8] F. Barbieri, L. E. Anke, J. Camacho-Collados, XLM

T: A multilingual language model toolkit for twitter, CoRR abs/2104.12250 (2021). URL: https://arxiv.org/ abs/2104.12250. arXiv:2104.12250. [9] F. A. Acheampong, H. Nunoo-Mensah, W. Chen,

Transformer models for text-based emotion detection: a review of bert-based approaches, Artificial

Intelligence Review (2021) 1–41. [10] J. Devlin, M. Chang, K. Lee, K. Toutanova,

BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.org/abs/ 1810.04805. arXiv:1810.04805.