=Paper= {{Paper |id=Vol-3224/paper02 |storemode=property |title=LIVING-LANG: Living digital entities by human language technologies |pdfUrl=https://ceur-ws.org/Vol-3224/paper02.pdf |volume=Vol-3224 |authors=Luis Alfonso Ureña López,Estela Saquete,María Teresa Martín-Valdivia,Patricio Martínez-Barco |dblpUrl=https://dblp.org/rec/conf/sepln/LopezSMM22 }} ==LIVING-LANG: Living digital entities by human language technologies== https://ceur-ws.org/Vol-3224/paper02.pdf
LIVING-LANG: Living digital entities by human language
technologies
LIVING-LANG: Tecnologías del lenguaje humano para entidades digitales vivas

L. Alfonso Ureña-López1 , Estela Saquete2 , María-Teresa Martín-Valdivia1 and
Patricio Martínez Barco2
1
  Computer Science Department, SINAI, CEATIC
Universidad de Jaén, Campus Las Lagunillas, 23071, Jaén, Spain
2
  Department of Software and Computing Systems,
University of Alicante, Spain


                                        Abstract
                                        This project pursues the dynamic modeling at a spatial-temporal level of digital entities in social media for
                                        predicting their behavior. Firstly, digital entities are modelled by identifying the characteristics of individuals
                                        through their language and footprint on the network. Then, the extraction of relationships between digital
                                        entities is one of the nuclear challenges of the project. The proposal pursues this objective on a semantic level,
                                        structuring the information into representations of knowledge suitable for logical processing. Considering the
                                        heterogeneous nature of the sources to be dealt with, filtering of information is fundamental, using metrics
                                        and quality criteria. This spatial-temporal characterization, together with screening processes, will allow us to
                                        study high-performance predictive strategies in the evolution of digital entities. This project is coordinated by
                                        the SINAI and GPLSI research groups.

                                        Keywords
                                        Natural Language Processing, Sentiment Analysis, Emotion Mining, Sentiment Enrichment.



1. Introduction                                                                               munities in what is known as Computational Social
                                                                                              Sciences [4]. This emerging discipline is fuelled by
Human language is the result of human social evolu-                                           the arrival of great volumes of information, primar-
tion, and thanks to it we can conceptualize reality,                                          ily from the social web. We exchange a vast amount
generating abstractions of it at different levels of                                          of information on the web. At the same time, our
complexity, which has given us a great capacity                                               habits regarding information consumption are at a
for reasoning. It has also enabled the organisation                                           critical time of transformation. Digital media, as
of complex social structures that have passed on                                              the preferred source of information, already threat-
culture and knowledge generation after generation                                             ens traditional written press. Young people choose
through the use of a common language [1]. Lan-                                                social networks as their means of communication.
guage determines the way in which we relate to one                                            Furthermore, this change in habit does not only af-
another and, according to some authors, even how                                              fect the format or the means where the information
we think about and conceive the reality in which                                              is found, we are also changing the speed and type of
we live [2]. In this way, language becomes a very                                             content. According to Turkle [5], we have gone from
valuable resource for the cognitive modelling of an                                           “I think, therefore I am” to “I share, therefore I am”,
individual as studied in psycholinguistics [3], but                                           reducing the quality of our “conversations” and, at
also for understanding social interactions and com-                                           the same time, creating the vague illusion of never
                                                                                              being alone, referred to by the term “echo chamber”.
SEPLN-PD 2022. Annual Conference of the Spanish                                               Technology also implies changes in the way we act.
Association for Natural Language Processing 2022:
Projects and Demonstrations, September 21-23, 2022, A
                                                                                              An example would be the way in which we read [6]
Coruña, Spain                                                                                 [7]. When we read digital media we “scan” rather
$ laurena@ujaen.es (L. A. Ureña-López); stela@dlsi.ua.es                                      than read. Short and simple content are almost
(E. Saquete); laurena@ujaen.es (M. Martín-Valdivia);                                          the only element of consumption (titles, captions,
patricio@dlsi.ua.es (P. M. Barco)                                                             highlighted sentences. . . ) [8], and we are often car-
 0000-0001-7540-4059 (L. A. Ureña-López);
0000-0002-6001-5461 (E. Saquete); 0000-0002-6001-5461
                                                                                              ried away by our emotions when we decide what to
(M. Martín-Valdivia); 0000-0002-6001-5461 (P. M. Barco)                                       read or where we read it. There are new challenges
                                    © 2022 Copyright for this paper by its authors. Use
                                    permitted under Creative Commons License Attribu-
                                                                                              in this new digital paradigm that must dealt with,
                                    tion 4.0 International (CC BY 4.0).
                                    CEUR Workshop Proceedings (CEUR-
                                                                                              derived from our inability to adapt to this new sce-
 CEUR
               http://ceur-ws.org
 Workshop      ISSN 1613-0073
 Proceedings

                                    WS.org)




                                                                                          5
nario and often resulting in the deterioration of our         behaviours of digital entities. These predictions can
cognitive abilities [9] [10]. This enormous amount            be used for the early detection of problems asso-
of information and digital connectivity entails the           ciated with violence, mental health problems such
development of a technology capable of modelling              as suicides, inappropriate behaviours and other se-
the new paradigm, as well as determining the rela-            curity and health risks. Therefore, for example, a
tionships that arise, their evolution in time and the         change in pattern of the type of language used in the
ability to interfere with or predict their behaviour          communication between two people can help detect
in the future.                                                the start of practices such as sexual harassment,
   Our previous project set out to identify digital en-       when language moves from a suggestive, captivat-
tities, considered to be any entity in the real world         ing or friendly language to that of a coercive or
(people, companies, organisations, tourist attrac-            threatening nature. As shown in Figure 1, the rela-
tions. . . ) with presence in the digital world and           tionships between entities are dynamic and change
from which we can obtain a complete profile from              with time as do their properties. By identifying
their activity in such an environment. This pro-              these variations and their patterns based on human
file is generated by processing unstructured content          language, we can prepare these networks for the
(web pages, articles, comments. . . ) using human             future by creating predictive models of peoples’ be-
language technologies. However, the present “digi-            haviour (risk detection, prevention of cyberbullying,
tal” situation requires us to go a step further and           terrorist warnings, etc.).
attempt to answer the following questions: a) How
can we ensure the social contextualization of these
entities, and model situations that change from day           2. Objetives
to day? b) How can we deduce new semantic rela-
                                                              The project started in 2018 and will be completed in
tionships between entities? c) How can we guarantee
                                                              2022, and it involves a number of specific challenges
that captured knowledge is real and contrasted by
                                                              and objectives of the overall project in the field of
multiple sources? d) How can we guarantee the
                                                              NLP research, which are detailed below:
coexistence of knowledge in the long term?
                                                                 OBJ1. Generation of the human language models
   This project aims to take this several steps fur-
                                                              used by digital entities through recognition of their
ther. In this way and thanks to these characteristics,
                                                              primary characteristics (linguistic, cognitive, social,
we can establish relationships between the entities
                                                              cultural and emotional) and independent of the
from a social and human perspective, improve the
                                                              domains and scenarios in which they act.
comprehension of the content exchanged, create new
                                                                 OBJ2. Use of the knowledge generated by digital
knowledge in the analysis of these relational struc-
                                                              entities and discovery of the semantic relationships
tures and eventually, characterise and predict these
                                                              between them. All available sources of information
networks between entities on a human language
                                                              (unstructured, structured and open linked data),
level by using temporal dimension, behaviours or
                                                              extraction mechanisms, identity enrichment, and
phenomena.
                                                              other inference mechanisms will be taken into ac-
   This ability to understand language, model it
                                                              count. This will enable the integration of informa-
and analyse its changes in time will allow us to
                                                              tion related to an identity, determining the roles and
face new challenges in the digital society in which
                                                              properties associated to a space-time framework. It
we live. By measuring the veracity and credibil-
                                                              also enables the definition of relationships between
ity of the relationships extracted, we can confront
                                                              identities using dynamic aspects such as context,
phenomena such as fake news, defined as a delib-
                                                              temporary nature or importance.
erate distortion of a reality with the objective of
                                                                 OBJ3. Use of knowledge of relationships to de-
creating and shaping public opinion and influencing
                                                              termine the coherence, quality and contrast of the
social attitudes. Thanks to this project, tasks such
                                                              semantic relationships extracted. For this, we will
as fact-checking, the automatic detection of ideo-
                                                              use veracity assessment techniques, emotion analy-
logical or confirmation bias, and the detection of
                                                              sis and subjectivity, as well as the detection of bias
clickbaits can be handled automatically, as well as
                                                              in the information to guarantee and contrast the
other post-truth problems that are difficult to detect
                                                              information that arises from the relationship.
and treat because of their “viral” content. Further-
                                                                 OBJ4. Prediction of future behaviour of digital
more, language modelling and new knowledge about
                                                              entities by discovering potential future semantic
these dynamic relationships and their evolution over
                                                              relationships between them, through the analysis
time will allow us, through the application of di-
                                                              of pre-existing networks and based on previously
verse techniques, to identify new characteristics and
                                                              detected relationships.
make inferences that provide predictions of future




                                                          6
Figure 1: Detection and monitoring of digital entities – Representation of an evolving environment over time



   In summary, this project contributes to the Span-   [11, 12], studying the characteristics of the different
ish national Plan for the Promotion of Human Lan-      scenarios in order to model the language in each
guage Technologies, which has aimed to promote the     of them. Resources associated with the different
development of natural language processing since       scenarios and domains defined have been created
2015.                                                  and used to train machine learning systems.
   To achieve the above global objective and the          Results regarding OBJ2: The project has worked
specific objectives of the global project, the coordi- on various techniques for knowledge extraction in
nation of two complementary sub-projects is pro-       the different domains and scenarios defined, as well
posed, whose specific objectives will cover the global as on the organisation of workshops such as eHealth-
objectives proposed, and whose reunification will      KD 2020 to model human language in health doc-
provide the added value sought by the coordination.    uments in Spanish [13]. In addition, knowledge
                                                       discovery techniques are being applied to the health
                                                       domain [14, 15]. In addition, work has been done
3. Results and conclusions                             on the discovery of temporal information to enrich
                                                       the entities by automatically extracting timelines
This section describes the most significant results
                                                       from the documents and generating summaries from
of the project.
                                                       these timelines [16].
   Results regarding OBJ1: In this project, the do-
                                                          Results regarding OBJ3: In relation to this objec-
mains to be worked on are mainly health and edu-
                                                       tive, a systematic study of the state of the art in
cation, as well as the following scenarios: fake news,
                                                       this matter has been carried out[17] and, based on
knowledge extraction, violence and hate speech
                                                       this study, work has been done to determine both




                                                         7
the veracity of the news and its parts and to study     [6] Y. Eshet, Thinking in the digital era: A revised
the detection of satire, achieving an architecture          model for digital literacy, Issues in informing
capable of determining 74% accuracy [18]. Within            science and information technology 9 (2012)
this task, progress has been made in the detection          267–276.
of incongruent headlines as well as in fact-checking    [7] D. Salyer, Reading the web: Internet guided
tasks, as part of the disinformation detection archi-       reading with young children, The Reading
tecture [19]. In addition, work has been done on            Teacher 69 (2015) 35–39.
emotion detection [20] [21] and negation [22].          [8] N. K. Hayles, How we read: Close, hyper,
   Results regarding OBJ4: Regarding this objec-            machine, ADE 150 (2010) 62–79.
tive, the project focused on the discovery of virality  [9] M. Bauerlein, The Dumbest Generation—How
patterns, applying opinion mining techniques that           the Digital Age Stupefies Young Americans
enable us to structure the information based on             and Jeopardizes Our Future, Jeremy P. Tarcher
the polarity of the messages and the emotions they          / Penguin, New York, 2008.
contain [23]. After transforming the information [10] N. Carr, The shallows: What the Internet is
from an unstructured textual representation to a            doing to our brains, WW Norton & Company,
structured one, association rules mining were used,         2020.
concluding that messages with a high-negative po- [11] F. M. P. del Arco, M. D. Molina-González,
larity and a very high emotional charge, especially         L. A. Ureña-López, M. T. Martín-Valdivia,
emotions that have intensified with the COVID-19            Comparing pre-trained language models for
pandemic, such as fear, sadness, anger and surprise         spanish hate speech detection, Expert Syst.
are more likely to go viral in social media.                Appl. 166 (2021) 114120. URL: https://doi.
   All publications related to the project can be           org/10.1016/j.eswa.2020.114120. doi:10.1016/
found on the project website1 .                             j.eswa.2020.114120.
                                                       [12] F. M. P. del Arco, M. D. Molina-González,
                                                            L. A. Ureña-López, M. T. Martín-Valdivia,
4. Acknowledgments                                          Detecting misogyny and xenophobia in span-
                                                            ish tweets using language technologies, ACM
This research work is funded by MCIN/AEI/
                                                            Trans. Internet Techn. 20 (2020) 12:1–12:19.
10.13039/501100011033 and, as appropriate, by
                                                            URL: https://doi.org/10.1145/3369869. doi:10.
“ERDF A way of making Europe”, by the “European
                                                            1145/3369869.
Union” or by the “European Union NextGenera-
                                                       [13] A. Piad-Morffis, Y. Gutiérrez, Y. Almeida-
tionEU/PRTR” through the grant LIVING-LANG
                                                            Cruz, R. Muñoz, A computational ecosys-
Project (RTI2018-094653-B-C21 / C22). It is a
                                                            tem to support ehealth knowledge discovery
coordinated project with SINAI and GPLSI as par-
                                                            technologies in spanish, J. Biomed. Informat-
ticipating research groups. It is also funded by Gen-
                                                            ics 109 (2020) 103517. URL: https://doi.org/
eralitat Valenciana through the project NL4DISMIS:
                                                            10.1016/j.jbi.2020.103517. doi:10.1016/j.jbi.
Natural Language Technologies for dealing with dis-
                                                            2020.103517.
and misinformation (CIPROM/2021/21).
                                                       [14] P. López-Úbeda, M. C. Díaz-Galiano,
                                                            T. Martín-Noguerol, A. Luna, L. A. U.
References                                                  López, M. T. Martín-Valdivia, Automatic
                                                            medical protocol classification using machine
 [1] M. Tomasello, A natural history of human               learning approaches,        Comput. Methods
       thinking, Harvard University Press, 2014.            Programs Biomed. 200 (2021) 105939. URL:
 [2] G. W. Grace, The linguistic construction of            https://doi.org/10.1016/j.cmpb.2021.105939.
       reality, Routledge, 2016.                            doi:10.1016/j.cmpb.2021.105939.
 [3] R. Rommetveit, Words, Meaning, and Mes- [15] P. López-Úbeda, M. C. Díaz-Galiano,
       sages: Theory and Experiments in Psycholin-          T. Martín-Noguerol, A. Luna, L. A. U.
       guistics, Academic Press, 2014.                      López, M. T. Martín-Valdivia, COVID-19
 [4] H. Wallach, Computational social science,              detection in radiological text reports inte-
       Comput. Soc. Sci. 307 (2016).                        grating entity recognition, Comput. Biol.
 [5] M. Arnd-Caddigan, Sherry turkle: Alone to-             Medicine 127 (2020) 104066. URL: https:
       gether: Why we expect more from technology           //doi.org/10.1016/j.compbiomed.2020.104066.
       and less from each other, 2015.                      doi:10.1016/j.compbiomed.2020.104066.
                                                       [16] C. Barros, E. Lloret, E. Saquete, B. Navarro-
    1
      https://livinglang.gplsi.es/                          Colorado, Natsum: Narrative abstractive




                                                     8
     summarization      through     cross-document
     timeline generation,           Inform. Proc.
     Manag. 56 (2019) 1775–1793. URL:
     https://www.sciencedirect.com/science/
     article/pii/S0306457318305922.       doi:https:
     //doi.org/10.1016/j.ipm.2019.02.010.
[17] E. Saquete, D. Tomás, P. Moreda, P. Martínez-
     Barco, M. Palomar, Fighting post-truth us-
     ing natural language processing: A review
     and open challenges, Expert Syst. Appl.
     141 (2020). URL: https://doi.org/10.1016/j.
     eswa.2019.112943. doi:10.1016/j.eswa.2019.
     112943.
[18] A. Bonet-Jover, A. Piad-Morffis, E. Saquete,
     P. Martínez-Barco, M. Á. G. Cumbreras,
     Exploiting discourse structure of traditional
     digital media to enhance automatic fake
     news detection,      Expert Syst. Appl. 169
     (2021) 114340. URL: https://doi.org/10.1016/j.
     eswa.2020.114340. doi:10.1016/j.eswa.2020.
     114340.
[19] R. Sepúlveda-Torres, M. E. Vicente, E. Sa-
     quete, E. Lloret, M. Palomar, Headlines-
     tancechecker: Exploiting summarization to
     detect headline disinformation, J. Web Se-
     mant. 71 (2021) 100660. URL: https://doi.org/
     10.1016/j.websem.2021.100660. doi:10.1016/j.
     websem.2021.100660.
[20] L. Canales, C. Strapparava, E. Boldrini,
     P. Martínez-Barco, Intensional learning to
     efficiently build up automatically annotated
     emotion corpora, IEEE Trans. Affect. Com-
     put. 11 (2020) 335–347. URL: https://doi.org/
     10.1109/TAFFC.2017.2764470. doi:10.1109/
     TAFFC.2017.2764470.
[21] L. Canales, W. Daelemans, E. Boldrini,
     P. Martínez-Barco, Emolabel: Semi-automatic
     methodology for emotion annotation of social
     media text, IEEE Trans. Affect. Comput early
     access (2019) 1–1. doi:10.1109/TAFFC.2019.
     2927564.
[22] S. M. Jiménez-Zafra, R. Morante, M. T.
     Martín-Valdivia, L. A. Ureña-López, Cor-
     pora annotated with negation: An overview,
     Comput. Linguistics 46 (2020) 1–52. URL:
     https://doi.org/10.1162/coli_a_00371. doi:10.
     1162/coli\_a\_00371.
[23] E. Saquete, J. Zubcoff, Y. Gutiérrez,
     P. Martínez-Barco, J. Fernández, Why are
     some social-media contents more popular than
     others? opinion and association rules mining
     applied to virality patterns discovery, Expert
     Syst. Appl. 197 (2022) 116676. URL: https:
     //doi.org/10.1016/j.eswa.2022.116676. doi:10.
     1016/j.eswa.2022.116676.




                                                   9