1. Introduction

SEBD

A Multi-Perspective Approach for Risky User Identification in Social Networks

(Discussion Paper)

Antonio Pellicani

0 1

Gianvito Pio

0 1

Michelangelo Ceci

0 1 2 0 Big Data Lab, National Interuniversity Consortium for Informatics (CINI) , Via Volturno, 58, 00185 Roma , Italy 1 Dept. of Computer Science, University of Bari "Aldo Moro" , Via E. Orabona, 4, 70125 Bari , Italy 2 Jožef Stefan Institute , Jamova Cesta 39, 1000 Ljubljana , Slovenia

2023

31 02 05

Social networks have become an integral part of modern communication, allowing people to connect and interact across the globe. However, they also bring along some negative phenomena, such as cyberbullying and social media addiction. As a result, monitoring user behavior and content has become essential to ensure a safe and responsible use of social networks. In this context, we recently proposed a novel system called SAIRUS, that we describe in this discussion paper. SAIRUS adopts three separate models to learn from multiple perspectives of social network data, namely the content posted by users, their relationships and their spatial closeness. We compare the system performance with 13 competitors on two real world datasets, demonstrating its superiority in identifying risky users and its usefulness as a tool for social network analysis.

eol>Social Network Analysis User Risk Identification Spatial Analysis

1. Introduction

Social networks enable people to connect and share news, opinions, and ideas through actions such as posting, liking, and following each other. This peculiarity fosters the creation of relationships and facilitates engagement in discussions on diverse topics and events. The widespread use of social networks has stimulated extensive research by the scientific community, mainly based on the use of Social Network Analysis (SNA) processes to explore the relationships and information exchange among users in the network [ 1 ]. In this context, our goal is to analyze social networks and identify risky users who engage in bad or illegal activities, such as drug selling or promotion, political or religious extremism, and discrimination against specific groups.

The identification of risky users is important for suspending suspicious accounts and preventing harmful behaviors in social network platforms. Many recent studies have focused on this area, including works on cyber-extremism and the identification of jihadist accounts [ 2, 3 ]. Methodologically, the identification of risky users can be approached as a node classification task and thus can be generally categorized into three approaches: content-based, topology-based, and hybrid. Content-based approaches focus on analyzing user-generated content [ 4, 5 ], while topology-based approaches consider only user relationships (e.g., established through following, liking, or commenting actions) in the network [ 6, 7 ]. Finally, hybrid approaches combine the strengths of content-based and topology-based methods, making them particularly efective in classifying borderline users who may have a mix of both safe and unsafe content or relationships [ 8, 9 ]. A well-known example of users that may fall into this category are journalists.

It is noteworthy that social networks have become popular due to the possibility of interacting with them using mobile devices, which also integrate geolocation mechanisms. However, there have only been a few early attempts to use the spatial dimension in the analysis of (social) network data [ 10, 11 ], and many of the existing general approaches are unable to take into account the information conveyed by the geographic locations of the users, which can implicitly define new relationships among them. To fill this gap, we proposed SAIRUS [ 12 ], which takes into account the content generated by users, their relationships in the network, and their geographical positions to identify risky users. SAIRUS fuses three node classification models, each learned from a diferent perspective, using a stacked generalization approach to obtain a more robust final model, that also exploits the uncertainty of the predictions.

Unlike existing hybrid approaches that inject artificially-defined features related to a perspective into the other(s), SAIRUS allows a separate focus on each perspective and ultimately combines their contribution to learn a final classifier. Specifically, for the user-generated content, SAIRUS uses word embeddings to train two autoencoders specialized in identifying safe and risky users; for user relationships and spatial information, two separate embeddings based on the analysis of network data are extracted and two classifiers are trained on top. In the following section, we provide some details about such approaches adopted by SAIRUS.

2. The method SAIRUS

Before introducing SAIRUS, we provide a formal definition of a social network as a 4-tuple: ⟨, , , ⟩, where: • = ∪ ( ∩ = ∅) is the set of users, either labeled () or unlabeled ( ).

Each labeled user is associated with the category safe or risky. • is the set of textual documents produced by users, that is, the posts. Each document ∈ is associated with a timestamp and a geographical location. • ⊆ × refers to the relationship between users and the textual content they produce or share, specifically the action of creating or posting a particular textual content. • ⊆ × represents the topology of the social network, determined by the connections established between users through social relationships, e.g. follows.

In Figure 1, the four key stages performed by SAIRUS are depicted: i) the semantic content analysis of the textual documents generated by users, ii) the network topology analysis of user relationships, iii) the analysis of spatial closeness among users, and iv) the model fusion. In the following subsections, we briefly detail each of them.

2.1. Semantic analysis of the user-generated content

The goal of this stage is to analyze the textual content produced by users and classify them as either safe or risky. It takes as input the set of textual documents and the set of relationships representing the link between users and the textual documents they posted. SAIRUS first applies standard Natural Language Processing (NLP) techniques such as tokenization, stopword removal, and stemming. Then, it concatenates all preprocessed documents posted by each user, taking into account the temporal order of the documents. This choice allows SAIRUS to implicitly capture the temporal evolution of the topics discussed by the user.

Subsequently, SAIRUS generates a -dimensional feature vector for each user, by applying the Word2Vec embedding method on each word of the concatenated documents. Specifically, an embedding for each user is obtained by summing up the embeddings of the words composing his/her concatenated document, according to the additive compositionality property [ 13 ].

In the final step, our attention is directed towards the labeled users . We train two distinct one-class classifiers using stacked autoencoders: for the vector representation of labeled risky users and for the vector representation of labeled safe users. For the unlabeled users ∈ , we provide their corresponding vector representation to both autoencoders and and calculate their reconstruction errors () and (). As a result, the semantic analysis of a user’s textual content produces three outputs: i) the reconstruction error () obtained by the autoencoder , ii) the reconstruction error () obtained by the autoencoder , iii) the predicted label () ∈ {, } (safe or risky), computed according to the minimum error achieved by and . These outputs are used in the model fusion phase (see Figure 1).

2.2. Analysis of the network of relationships

SAIRUS considers the topology of the social network by directly analyzing the adjacency matrix ∈ R||×| |, where = 1 if (, ) ∈ , = 0 otherwise, and and are the -th and the -th user of the network, respectively. However, the analysis of adjacency matrices may lead to issues due to high dimensionality and sparseness, since each user usually tends to establish relationships with a very small percentage of the whole set of users.

Many existent works rely on dimensionality reduction techniques to address the high dimensionality and sparseness problems. SAIRUS can work directly on the adjacency matrix ∈ R||×| |, or on a transformed matrix ′ ∈ R||× resulting from the application of a dimensionality reduction technique to , where is a user-defined parameter. Specifically, SAIRUS can exploit PCA, autoencoders and Node2Vec, even if other techniques can be easily plugged in the workflow.

A node classification model is finally trained using the entire set of labeled users . In this phase, SAIRUS exploits tree-based classifiers since they proved to provide optimal performances on classification problems in the semi-supervised scenario [ 14 ]. When provided with an unlabeled user ∈ , the learned decision tree returns the predicted label () and a confidence value (), which is based on the purity of the training examples associated with the leaf node where falls into. The predicted label and confidence value are then used in the model fusion phase, as illustrated in Figure 1.

2.3. Analysis of the spatial closeness among users

Similar to the analysis of the network of relationships, also the spatial analysis exploits an adjacency matrix built from the social network. In this case, SAIRUS uses a weighted matrix ∈ R||×| |, where = (, ) corresponds to the spatial closeness between the user and the user . Specifically, (, ) is based on the geodetic distance (, ) between the geographical locations of the users and that are estimated as the mode of the geographical locations associated to their posts on the social network. We standardize the distance (, ) using the -score normalization, obtaining (, ), that allows us to distinguish two groups of user pairs: those who are spatially closer than the average (with (, ) < 0) and those who are spatially more distant than the average (with (, ) ≥ 0). Accordingly, we calculate (, ) as follows: (, ) = ⎧ (, ) , if (, ) < 0 ⎨ ⎩0, otherwise (1) where is the minimum of the normalized distances between two users. Note that we further normalize (, ) over in order to obtain a value in the range [ 0, 1 ], where 0 means that the users and are very far from each other (actually, more than the average) and 1 means that and are located precisely at the same location.

After computing the matrix , we use a dimensionality reduction technique to obtain the reduced matrix ′ ∈ R||× , where is a user-defined parameter. Then, we train a node classification model on the labeled users . Similar to the approach used for the network of relationships, we use a decision tree learner, which provides a predicted label () and a confidence value () for any unlabeled user ∈ . These outputs are then used in the model fusion phase (see Figure 1).

2.4. Model Fusion

The aim of the last step is to combine the results of the models based on the textual content, the network topology, and the spatial dimension to classify the unlabeled users in . In SAIRUS, we use a Multi-Layer Perceptron (MLP) model to perform this task, following the Stacked Generalization approach [ 15 ].

The chosen MLP architecture is depicted in the bottom of Figure 1. It has an input layer comprising of 7 neurons, which considers the following inputs for a given user : i) the reconstruction error values of the safe autoencoder () and risky autoencoder (), along with the predicted label () derived from the semantic analysis component for the textual content; ii) the predicted label () and confidence value () obtained from the component responsible for analyzing the network of relationships; iii) the predicted label () and the confidence value () obtained from the component responsible for the spatial analysis. We use the sigmoid activation function in the hidden layer to capture any non-linear relationships between the input and output variables. In contrast, we use the softmax activation function in the output layer for the final classification.

It is noteworthy that our approach, which uses the stacked generalization framework, does not require any user-defined criteria/weight to merge the outputs of three distinct models. Moreover, in contrast to ensemble techniques that solely rely on combining predictions ( (), (), and (), in our case), SAIRUS can incorporate other features such as reconstruction errors () and (), and prediction confidences () and (), that make it more robust to the uncertainty of the predictions and to the possible presence of noise in the data.

3. Experiments

We collected a real-world dataset from Twitter to evaluate the performance of SAIRUS. The dataset was associated with sentiment scores for each tweet, which were computed using the Stanford CoreNLP Toolkit and manually revised by three domain experts.

To label users as either risky or safe, two strategies were employed. The first strategy relied on identifying tweets containing specific keywords related to threats, terrorism, hate against immigrants, and women. The second strategy assigned a score to each user by summing the sentiment scores of their tweets. The assumption was that users with a higher number of negative sentiment tweets are more likely to be risky.

To ensure the accuracy of the labeling process, we initially labelled the top-ranked users as safe and the bottom-ranked users as risky, whose posts were also manually inspected by three expert reviewers. We also introduced a set of borderline users, who were initially classified as risky but had mostly safe connections, to introduce noisy data under controlled conditions. These users may correspond to journalists who share negative content for informational purposes, but have mostly connections with safe users. The resulting datasets consisted of 2241 safe users (including 263 borderline users) and 1467 risky users for the keyword strategy, and 2047 safe users (including 304 borderline users) and 1033 risky users for the sentiment strategy, with 11,659,043 and 13,970,379 tweets, respectively.

We assessed the performance of SAIRUS using PCA, Node2Vec, and Autoencoders for the reduction of the dimensionality. We also evaluated the results with diferent values of the embedding dimensionality, namely for the semantic analysis of the textual content, for the analysis of the network of relationships, and for the spatial analysis. After conducting some preliminary evaluations, we chose the following parameter combinations for the experiments: ⟨=128, =256, =256⟩, ⟨=256, =128, =128⟩, and ⟨=512, =128, =128⟩. For space constraints, here we report the best results (all the results can be found in [ 12 ]).

We compared the performance of SAIRUS with several other methods, including a Random Forest model (RF) with 100 trees, and two one-class classifiers based on autoencoders ( 1C-AEs) designed for content-based analysis, which is consistent with the methodology used in SAIRUS. We used diferent feature sets, each focusing on one or more perspectives, such as content (C), relationships (R), or spatial (S). For multiple perspectives, we concatenated the feature sets of each single perspective (C+R, C+S, R+S, and C+R+S). To embed the textual content, we used state-of-the-art systems such as Word2Vec (w2v) and Doc2Vec (d2v), with embedding dimensionality set to , which is the same as that used by SAIRUS. The embedding of the network of relationships and of the spatial closeness, we used Node2Vec (n2v) with embedding dimensionality set to and , respectively, following the setting adopted for SAIRUS.

We adopted a stratified 5-fold cross-validation technique, which preserved the proportion of safe and risky users, as well as the ratio of borderline users within safe users. Our evaluation metrics included precision, recall, F1-Score, and accuracy, with the positive class being the risky label. In addition, we computed these measures specifically on the borderline users to determine the performance of the methods in handling noisy data.

3.1. Results

In Tables 1 and 2, we show the results obtained on the sentiment dataset and on the keywords dataset, respectively, where we emphasize the best result obtained for a given evaluation measure. By looking at the competitor solutions solely based on textual content, we notice that the use of w2v generally leads to better results than d2v (as also observed in [ 16 ]). On the other Results on the sentiment dataset, with = 512, = 128, = 128. one solution over the other. However, the adoption of features related to user relationships (R), to the spatial closeness (S), or a combination of these perspectives did not seem to provide a clear contribution to the competitors. This result confirms that simply injecting features coming from one perspective into the other could also compromise the classifier performances due to the possible introduction of issues related to the course of dimensionality.

In contrast, SAIRUS achieved the best results when leveraging the network of user relationships or the spatial dimension (or both). This was particularly evident in the sentiment dataset, where the F1-score reached ∼ 0.8 when both user relationships and the spatial analysis were considered. These results demonstrate that the fusion strategy adopted by SAIRUS is more efective than the concatenation of features. In the keywords dataset, the configuration that leveraged both textual content and spatial analysis slightly emerged as the best. These results confirmed the relevance of the spatial perspective and the importance of properly modeling and exploiting it through a smart fusion strategy. Moreover, the obtained results prove that the spatial dimension is an important factor for predicting borderline users, regardless of network representation used. In other words, incorporating spatial information improves the accuracy of predictions for borderline users.

SAIRUS outperformed competitors in both datasets, demonstrating its ability to efectively Results on the keywords dataset, with = 128, = 256, = 256 for distinguishing between safe and risky users in social networks, paving the way towards its adoption for the analysis of large amounts of data from geo-located mobile devices.

4. Conclusion

This paper discussed SAIRUS, a novel approach for identifying risky users in social networks. By combining multiple perspectives of social network data, including textual content, user relationships, and spatial closeness, SAIRUS can accurately classify users, outperforming 13 competitor systems that exploit either one perspective at a time or a combination thereof. In our experiments, SAIRUS also proved to be robust to the presence of noisy users.

In addition to its current capabilities, SAIRUS has the potential to incorporate the temporal dimension related to textual content and detect sudden changes in user behavior. Therefore, future work will focus on extending SAIRUS to make it able to capture the dynamism of the network of relationships and spatial closeness among users, providing a more comprehensive risk assessment of social network users. The authors acknowledge the support of the European Commission through the H2020 Project “CounteR - Privacy-First Situational Awareness Platform for Violent Terrorism and Crime Prediction, Counter Radicalisation and Citizen Protection” (Grant N. 101021607). This work was also partially supported by the project FAIR - Future AI Research (PE00000013), Spoke 6 Symbiotic AI, under the NRRP MUR program funded by the NextGenerationEU.

[1]

Tabassum ,

F. S.

Pereira ,

Fernandes ,

Gama , Social network analysis: An overview , Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 ( 2018 ).

[2] I. Awan , Cyber-Extremism: Isis and the Power of Social Media , Society 54 ( 2017 ) 138 - 149 .

[3]

Al-Rawi ,

Groshek , Jihadist Propaganda on Social Media: An Examination of ISIS Related Content on Twitter, Int. Journal of Cyber Warfare and Terrorism 8 ( 2018 ) 1 - 15 .

[4]

V. N.

Uzel ,

E. Saraç

Eşsiz ,

S. Ayşe

Özel , Using fuzzy sets for detecting cyber terrorism and extremism in the text , in: ASYU 2018 , 2018 , pp. 1 - 4 .

[5]

Le , T. Mikolov, Distributed representations of sentences and documents , in: International conference on machine learning , 2014 , pp. 1188 - 1196 .

[6]

S. A.

Macskassy ,

Provost , Classification in networked data: A toolkit and a univariate case study , Journal of machine learning research 8 ( 2007 ) 935 - 983 .

[7]

Bilgic , L. Getoor, Efective label acquisition for collective classification , in: Proc. ACM SIGKDD 2008 , KDD '08, ACM , 2008 , pp. 43 - 51 .

[8]

Mateen ,

M. A.

Iqbal ,

Aleem ,

M. A.

Islam , A hybrid approach for spam detection for twitter , in: 2017 14th International Bhurban Conference on Applied Sciences and Technology (IBCAST) , 2017 , pp. 466 - 471 .

[9]

Hamdi ,

Slimi , I. Bounhas,

Slimani , A hybrid approach for fake news detection in twitter based on user features and graph embedding , in: Distributed Computing and Internet Technology , Springer International Publishing, Cham, 2020 , pp. 266 - 280 .

[10]

Medina , G. Hepner, Geospatial analysis of dynamic terrorist networks , in: Values and violence , Springer, 2008 , pp. 151 - 167 .

[11]

M. A.

Masood ,

R. A.

Abbasi , Using graph embedding and machine learning to identify rebels on twitter , Journal of Informetrics 15 ( 2021 ) 101121 .

[12]

Pellicani ,

Pio ,

Redavid ,

Ceci , Sairus: Spatially-aware identification of risky users in social networks , Information Fusion 92 ( 2023 ) 435 - 449 .

[13]

Mikolov , I. Sutskever,

Chen , G. Corrado,

Dean , Distributed representations of words and phrases and their compositionality , CoRR abs/1310 .4546 ( 2013 ). arXiv: 1310 . 4546 .

[14]

Levatic ,

Kocev ,

Ceci , S. Dzeroski, Semi-supervised trees for multi-target regression , Inf. Sci . 450 ( 2018 ) 109 - 127 .

[15]

D. H.

Wolpert , Stacked generalization, Neural Networks 5 ( 1992 ) 241 - 259 .

[16] G. De Martino , G. Pio, M. Ceci, PRILJ: an eficient two-step method based on embedding and clustering for the identification of regularities in legal case judgments , Artificial Intelligence and Law 30 ( 2022 ) 359 - 390 .