1. Introduction

FD. A Platform for Monitoring Financial and Economic Information towards Alternative Investment Funds

José Antonio García-Díaz

José Antonio Miñarro-Giménez

Ángela Almela

Gema Alcaraz-Mármol

Gema.Alcaraz@uclm.es 0

María José Marín-Pérez

Francisco García-Sánchez

Rafael Valencia-García

valencia@um.es 1 0 Departamento de Filología Moderna, Universidad de Castilla La Mancha , 45071 , España 1 Facultad de Informática, Universidad de Murcia, Campus de Espinardo , 30100 Murcia , España 2 Facultad de Letras, Universidad de Murcia , Campus de la Merced, 30001, Murcia , España

For eficient financial asset management, it is necessary to select, process and analyze specific information on the Internet. However, the large volumes of information available and the fact that most of this data is stored in an unstructured way, hinder this task. In this demo, we present Financial Dashboard (FD), a global platform to monitor financial assets from the Internet using Natural Language Processing and Semantic Web technologies focused on the Spanish language. The objective is to allow users to monitor financial data from a set of keywords, accounts and digital newspapers. FD compiles data periodically and annotates semantic information such as financial entities or sentiments. All the information is made available to users from a web dashboard composed by configurable and independent KPIs and a REST API.

eol>Alternative Investment Funds Sentiment Analysis Semantic Web Natural Language Processing

1. Introduction FD (Financial Dashboard) is a global platform that

eases the monitoring of financial assets from the InterTo boost financial management and to improve the efi- net using NLP tools and Semantic Web technologies. In ciency in the use of public and private economy-related a nutshell, this tool allows managers of companies and resources, it is necessary to monitor the Internet in search technicians of public organizations to establish a set of of financial data. This process involves selecting, process- keywords, social networks accounts and digital newsing and analyzing the global and local financial activity. A papers to monitor. The system compiles the data from proper financial management helps companies and public those sources periodically and extract semantic inforauthorities to identify risks and opportunities. However, mation including financial entities or sentiments. Also, this task is not an easy one. First, there is a huge amount the system scores each piece of information in order to of information on the Internet, which makes it challeng- determine their relationship with a set of objectives that ing to handle, especially with real time requirements. can be previously defined by end-users. Finally, all inSecond, most information can be found stored in an un- formation is made accessible through a web dashboard structured or semi-structured way, so it is hard to take composed by configurable and independent Key Perforadvantage of all such information. Third, state-of-the-art mance Indicators (KPIs) and deployed in the form of a technologies for Natural Language Processing (NLP) are REST API. mainly focused on the English language and have not At the technological level, this platform makes use of been tested properly in Spanish texts. NLP techniques based on state-of-the-art Large Language Models (LLMs) for extracting objective and subjective information from textual sources and Semantic Web technologies to map those concepts to a domain ontology.

2. Background information In this section we briefly analyze similar tools and ap

proaches for monitoring financial data on the Internet.

In [1] the authors describe a system procedure for measuring explicit and implicit linkages between large U. S. bank holding companies. In their methodology, the authors propose the usage of mixed-frequency regression techniques. This component provides bank supervisors with knowledge about when new assets need to be mon- 3.1. Layer 1. Data acquisition module itored. Besides, the authors demonstrate how variables concerning outcome can be applied to measure the extent This project has two main data sources, namely, news to which firms are interconnected. Another related study from digital newspapers and publications from social meis [2], in which the authors introduce a framework for dia sites. On the one hand, the news are extracted using quantitative investments and trading in financial markets. a custom web crawler. This crawler can filter news sites This framework is subdivided into four components to based on two strategies. The first strategy is to filter by monitor global variables, including ( 1 ) quantitative invest- URL using regular expressions. For example, it is possiment trading, ( 2 ) financial risk monitoring, ( 3 ) economic ble to restrict the system to consider only pages whose situation, and ( 4 ) environmental risk monitoring. URL contains /economia/. The second strategy is to

These previous studies do not take into account so- iflter using CSS filters, as some of the news sites include cial data. However, in [3] the authors monitor fine-grain certain rules in the style to denote financial content. The housing rental prices in order to bring insights for fair content is then stored in a markdown format, as we keep housing policies. For this, they focus on housing rental some structural information of the news. On the other websites for their studies in China. They consider fea- hand, the social media items are extracted from Twitter. tures concerning the location, the neighborhood, the We use the Twitter API together with the UMUCorpushome structure or accessibility, among other features. Classifier tool [ 4] to filter certain Twitter accounts of They use time data between 2017 and 2018 and evaluate digital newspapers. several classical machine-learning regression algorithms The stored data is also pre-processed in order to resuch as random forest, gradient-boosting or support vec- move hyperlinks, mentions, and languages that are not tor machines. Their results suggest that most of the gen- Spanish. Finally, every piece of information is geoerated models have good performance and that the two located with a latitude, longitude and a radius. The radius most influential features are related to job opportunity allows to set very specific items located to specific regions and accessibility to health care services. or cities or to be more generic, spotting autonomous com

As far as our knowledge goes, there are no studies fo- munities or even countries. To calculate this position, cused on monitoring financial data from social networks we use diferent heuristics such as looking for locations such as Twitter considering texts written in Spanish. in the headline or the main text and then use a reverse geolocation utility, to set the current position in the map.

If no information is found, we search for meta-data and author information.

3. System architecture

a new corpora of financial headlines and sites to conduct have been obtained, the relevance of each piece of infora targeted sentiment analysis. The goal was to extract mation is ranked according to the users’ interests. This the main economic target of the document and then the process allows to prioritize some items over others. sentiment towards this target as well as the sentiments towards other companies and society in general. For 3.3. Dashboard this, we use two neural network models. The first one is trained with a Named Entity Recognition (NER) task, and The last module of the FD platform is a dashboard. This it is capable of identifying the target. The second neural dashboard is built using progressive web technologies. It network is a multi-label document classification model enables users to create multiple projects and dashboards. that it is able to capture the sentiments towards the three Each dashboard is composed of several configurable KPIs, targets (main economic target, companies, and society) with each KPI associated to a group of customs filters. at once. For the latest model, diferent LLMs have been These filters allow to set concepts from the ontology, evaluated, including large and base models of BETO and time series, keywords, or the way in which data is to be MarIA, and also lightweight models based on distillation visualized (e.g., word clouds, timelines, heatmaps, tables, and multilingual LLMs. etc.). Besides, each KPI can be attached to specific filters

The next module is focused on extracting entities and and facilitates the comparison of trends to final users. It mapping them to a domain ontology. For this, we created is also worth mentioning that the data from each KPI is a novel ontology that contains concepts related to dif- accessible using a REST API too, so that the platform can ferent financial sectors including tourism, technologies, be interconnected to external systems and tools. health, industry or energy, among others. Each concept Figure 2 presents a screenshot of the dashboard. As it in the ontology allows to define a set of named entities can be seen, the dashboard contains a generic filter for to identify relevant companies and related actors. Each all KPIs. In the capture, this filter is configured to show compiled piece of information is mapped to the ontol- data from the last six months for four topics, including ogy using semantic annotation based on an extended electricity, the European Central Bank, the rental price, version of the Term-Frequency Inverse Document score and diesel oil. Below, the main KPIs are shown. Some of (TF–IDF–e) [8]. This strategy is based on the TF–IDF them are configured to show the sentiments and targets measure, that calculates the frequency of diferent terms for the selected topics. Another KPI panel contains the (TF) and weights this information concerning how in- number of documents per topic, and there are also KPI formative is the term in the rest of the documents (IDF). panels to show relevant documents, pie charts and a word Once we have obtained the TF–IDF for each of the terms cloud. of the ontology that appear explicitly in the texts, we It worth noting that the KPIs that are organized per calculate the weight of the terms that appear implicitly. target are based on the dataset published in the shared For this, the extended TF–IDF takes into account the dis- task FinancES 2023 [9], which consists in determining the tance between each identified entity with the rest of the main entity that appears in economic headlines and the concepts in the ontology. sentiments towards this target, other companies and soci

Once the sentiments and the semantic annotations ety in general. Targeted sentiment analysis can determine the polarity of certain texts to diferent economical and social groups. This strategy distinguishes among three types of targets: ( 1 ) the main economic entity (MET), ( 2 ) the rest of the companies, and ( 3 ) the society and consumers. Besides, this approach can extract the main entity using a NER system.

4. Further work

In this work we have described the FD platform for monitoring economic and financial data on the Internet. This platform relies on Semantic Web technologies and NLP techniques for extracting, annotating and classifying financial data from several data sources, including web sites and social networks. The data is presented to the end-users in a web platform that allows them to configure a personalized dashboard with a set of configurable KPIs.

We are currently on the last stages of the development of the platform and we are preparing several case studies for its validation. The further work is focused on improving the explainability of the neural network models. In particular, we plan to create a module based on linguistic features [10] and define KPIs that highlight the relevant parts of the text that contributed the most to the predictions of the sentiments. Another idea for improving the platform is to add more data filters.

Currently, we are working on incorporating information from video platforms such as YouTube. We will also focus on the definition of KPIs that cluster results per data-source [11] and improving the number of filters. Finally, we will evaluate the feasibility of incorporating KPIs for the detection of fake news.

5. Acknowledgments This work is part of the research projects AIInFunds (PDC2021-121112-I00) funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR.

URL: http://journal.sepln.org/sepln/ojs/ojs/index. php/pln/article/view/6292. [5] A. Gutiérrez-Fandiño, J. Armengol-Estapé, M. Pàmies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, C. Armentano-Oller, C. RodriguezPenagos, A. Gonzalez-Agirre, M. Villegas, MarIA: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022) 39–60. URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/ pln/article/view/6405. [6] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, in: PML4DC at ICLR 2020, 2020, pp. 1–10. [7] J. A. García-Díaz, F. García-Sánchez, R. ValenciaGarcía, Smart analysis of economics sentiment in spanish based on linguistic features and transformers, IEEE Access 11 (2023) 14211–14224. URL: https://doi.org/10.1109/ACCESS.2023.3244065. doi:10.1109/ACCESS.2023.3244065. [8] M. Á. Rodríguez-García, R. Valencia-Garcí, F. García-Sánchez, J. J. Samper-Zapater, Creating a semantically-enhanced cloud services environment through ontology evolution, Future Generation Computer Systems 32 (2014) 295–306. URL: https://www.sciencedirect.com/science/article/pii/ S0167739X13001684. doi:https://doi.org/10. 1016/j.future.2013.08.003. [9] J. A. García-Díaz, F. García-Sánchez, R. Valencia García, Overview of FinancES 2023: Financial targeted sentiment analysis in spanish (to appear), Procesamiento del Lenguaje Natural (2023). [10] J. A. García-Díaz, P. J. Vivancos-Vicente, A. Almela, R. Valencia-García, UMUTextStats: A linguistic feature extraction tool for spanish, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 6035–6044. [11] J. A. García-Díaz, R. Colomo-Palacios, R. ValenciaGarcía, Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020, Future Generation Computer Systems 130 (2022) 59–74. doi:10.1016/j.future.2021.12.011.

[1]

Hale ,

J. A.

Lopez , Monitoring banking system connectedness with big data , Journal of Econometrics 212 ( 2019 ) 203 - 220 . URL: https://www.sciencedirect.com/science/article/pii/ S030440761930082X. doi:https://doi.org/10. 1016/j.jeconom. 2019 . 04 .027.

[2]

Pan , Intelligent finance global monitoring and observatory : A new perspective for global macro beyond big data , in: 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS) , 2019 , pp. 623 - 628 . doi: 10 .1109/ICPHYS. 2019 . 8780156 .

[3]

Hu ,

He ,

Han ,

Xiao ,

Su ,

Weng ,

Cai , Monitoring housing rental prices based on social media:an integrated approach of machine-learning algorithms and hedonic modeling to inform equitable housing policies , Land Use Policy 82 ( 2019 ) 657 - 673 . URL: https://www.sciencedirect.com/science/article/pii/ S0264837718316429. doi:https://doi.org/10. 1016/j.landusepol. 2018 . 12 .030.

[4]

J. A.

García-Díaz , Á. Almela,

Alcaraz-Mármol , R. Valencia-García, UMUCorpusClassifier: Compilation and evaluation of linguistic corpus for natural language processing tasks , Procesamiento del Lenguaje Natural 65 ( 2020 ) 139 - 142 .