An investigation on Not Safe For Work adult content in Reddit (Discussion Paper) Francesco Cauteruccio1 , Enrico Corradini2 , Giorgio Terracina1 , Domenico Ursino2 and Luca Virgili2 1 DEMACS, University of Calabria, Italy 2 DII, Polytechnic University of Marche, Italy Abstract Reddit is one of the few social platforms that handles NSFW (Not Safe For Work) content in an explicit and well-structured way. Despite this fact, such an issue has been very neglected in the past by researchers who have studied this social network. In this paper, we aim at providing a contribution in this setting by proposing an approach to extract and analyze text patterns from NSFW content in Reddit. An important peculiarity of our approach is that patterns are extracted not only based on their frequency (as it generally happens in the past literature), but also, and especially, on one or more utility measures. Keywords Reddit, NSFW posts and comments, Text patterns, Pattern utility measures, Social Network Analysis 1. Introduction The term “Not Safe For Work” (hereafter, NSFW) is used by many social media to refer to content within them that cannot be viewed in public or professional settings. The study of the phenomenon of NSFW content in social media has attracted many authors (e.g., [1]). One of the social platforms that has adopted the concept of NSFW in an explicit and well-structured way is Reddit. However, despite this, few researchers have investigated the features of NSFW content in Reddit [2, 3, 4]. In particular, in [4], the authors use Social Network Analysis to investigate NSFW posts on Reddit. They focus on the structural features of the posts without analyzing their content. In this paper, we want to continue the research efforts of [4] and, once again, we use Social Network Analysis to study NSFW posts and comments on Reddit. However, here, we shift the research focus from structure to content. More specifically, we propose an approach for extracting and analyzing text patterns present in NSFW adult content on Reddit. In our context, we use the term “pattern” to refer to a set of words in posts or comments that satisfy certain properties. Our approach consists of three main steps, namely: (i) Data Cleaning and Annotation, (ii) Pattern Extraction and Enrichment, SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy " cauteruccio@mat.unical.it (F. Cauteruccio); e.corradini@pm.univpm.it (E. Corradini); terracina@mat.unical.it (G. Terracina); d.ursino@univpm.it (D. Ursino); l.virgili@pm.univpm.it (L. Virgili)  0000-0001-8400-1083 (F. Cauteruccio); 0000-0002-1140-4209 (E. Corradini); 0000-0002-3090-7223 (G. Terracina); 0000-0003-1360-8499 (D. Ursino); 0000-0003-1509-783X (L. Virgili) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Data Removing bot's content, cleaning text Data Cleaning Sentiment annotation and Annotation Lexical annotation Pattern Frequent pattern mining Extraction and Filter patterns by utility Enrichment Enrich patterns with related data information Building of pattern-based networks and Network-based user-based networks Pattern Analysis Network analysis Figure 1: The general workflow of our approach and (iii) Network-based Pattern Analysis. Applying our approach on Reddit allowed us to make several contributions to this research scenario. They involve: (i) discovering that traditional approaches to sentiment computation are unreliable in the context of NSFW adult content; (ii) defining and finding opinion leaders in real communities sharing NSFW adult content; (iii) discovering text patterns representing the building blocks of NSFW posts and comments on Reddit; (iv) determining new virtual communities of users sharing NSFW adult content; (v) identifying opinion leaders who could influence such communities. The rest of this paper is structured as follows: Section 2 provides a general description of our approach and the dataset used for our experiments. Section 3 illustrates the Pattern Extraction and Enrichment step. Section 4 describes the Network-based Pattern Analysis step. Finally, Section 5 presents our conclusions and possible future developments of our research efforts. 2. General overview of our approach The general workflow of our approach is shown in Figure 1, which highlights the three steps composing it (i.e., Data Cleaning and Annotation, Pattern Extraction and Enrichment, and Network-based Pattern Analysis). The Data Cleaning and Annotation step removes irrelevant content and standardizes text representations. It also performs lexical (e.g., part-of-speech and named entities) and sentiment annotations. These latter highlight the polarity of sentiments expressed in the texts, represented in terms of a compound score, computed by applying Vader [5]. Due to space limitations, we do not illustrate this step in detail in this paper. The Pattern Extraction and Enrichment step extracts a set of text patterns from the posts and comments identified in the previous step; they are the basis for the next Network-based Pattern Analysis step. To this end, it first extracts frequent patterns. Then, it associates each pattern with a rich set of features regarding the posts and comments it derives from, as well as the users who published them. Afterwards, it defines some utility measures and associates the corresponding values with each pattern. Finally, it selects only those patterns with high frequency and high utility. Our approach allows the definition of different concepts and utility measures and, consequently, the selection of different sets of useful patterns based on them. This allows us to analyze the available NSFW content from very different perspectives, yet adopting a uniform methodology. The Network-based Pattern Analysis step applies the concepts and approaches of Social Network Analysis to the patterns obtained during the previous step with the goal of extracting information and knowledge from them. Specifically, it constructs and uses three social networks, namely: (i) User Interaction Network, in which a node 𝑛𝑖 represents a user 𝑢𝑖 , who published at least one post or comment. An arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) denotes that 𝑢𝑖 commented a post of 𝑢𝑗 ; 𝑤𝑖𝑗 indicates how many times this happened. (ii) Pattern Network, in which a node 𝑛𝑖 represents a pattern 𝑝𝑖 extracted in the previous step. An arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) indicates that 𝑝𝑖 and 𝑝𝑗 were adopted by at least one user in common; 𝑤𝑖𝑗 indicates the number of users who adopted both 𝑝𝑖 and 𝑝𝑗 . (iii) User Content Network, in which a node 𝑛𝑖 represents a user 𝑢𝑖 , who published at least one post or comment. An arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) indicates that there is at least one comment posted by 𝑢𝑖 and at least one comment posted by 𝑢𝑗 containing the same pattern; 𝑤𝑖𝑗 denotes the number of times this happened. Once these networks are built, this step proceeds by applying Social Network Analysis concepts and approaches to them for extracting information and knowledge on Reddit users publishing, commenting and reading NSFW adult posts and on the content these users exchange. Due to space limitations, in this paper, we focus only on the User Interaction Network. To perform our experiments for evaluating our approach, we downloaded a dataset from the pushshift.io [6] website, which represents one of the main data repositories for Reddit. Specifically, we considered 449 NSFW adult subreddits listed at https://www.reddit.com/r/ ListOfSubreddits/wiki/nsfw and extracted all posts and the corresponding comments published on them from January 1𝑠𝑡 , 2020 to March 31𝑠𝑡 , 2020. The number of posts on the dataset is 3,064,758, while the number of comments is 11,627,372. 3. Pattern Extraction and Enrichment This step extracts text patterns from posts and comments in the dataset and, then, enriches them with additional information concerning their frequency and utility. Pattern mining plays a key role in this activity. It is a well known task in the literature, which extracts from a dataset some (hopefully interesting and/or unexpected) information that can be understood by humans. Many pattern mining approaches are based on the concept of pattern frequency and aim at identifying the most frequent patterns in the texts received as input. They are based on the assumption that frequent patterns are interesting [7, 8]. This is true in many application contexts. However, there are cases where it does not hold. To handle these cases, the notion of pattern utility has been introduced. It shifts the emphasis from frequent pattern mining to High Utility Pattern Mining (hereafter, HUPM) [9, 10]. In this case, a utility function is defined; the patterns with a high value of this function are considered interesting. Recall that a utility function denotes a user preference ordering over a set of choices [9]. It is clearly a subjective measure allowing us to state the usefulness of a text pattern from different perspectives, depending on our preferences and/or needs. After having introduced the concepts of frequency and utility of a pattern, we can illustrate our approach to pattern extraction and the model which it operates on. Let 𝒞 = {𝑐1 , 𝑐2 , · · · , 𝑐𝑛 } be a set of lemmatized comments, obtained at the end of the Data Cleaning and Annotation step. Each comment 𝑐𝑖 ∈ 𝒞 corresponds to a post and is written by a Reddit user. We can represent 𝑐𝑖 as a set of lemmas 𝑐𝑖 = {𝑙1 , 𝑙2 , · · · , 𝑙𝑚 }. Therefore, if we denote by ℒ = {𝑙1 , 𝑙2 , · · · , 𝑙𝑞 } the set of all possible lemmas, then 𝑐𝑖 is a subset of ℒ. From the HUPM perspective, each lemma is an item. A pattern 𝑃𝑗 is a set of items and, therefore, 𝑃𝑗 ⊆ ℒ. 𝑃𝑗 can occur in zero, one or more comments in our dataset. We denote by 𝒞𝑗 ⊆ 𝒞 the subset of the comments of 𝒞 in which 𝑃𝑗 is present, and by frequency of 𝑃𝑗 the cardinality of 𝒞𝑗 . 𝑃𝑗 inherits the set of features characterizing the comments of 𝒞𝑗 , and the utility of 𝑃𝑗 can be defined as an appropriate function of all or some of these features. The choice of the features and the utility function to adopt determine the perspective one whishes to consider in the analysis of patterns. For example, consider the features score_comm (denoting the score of a comment) and compound (indicating the sentiment value extracted from the text of a comment). Suppose that the utility function is the Pearson’s correlation [11] between them, which allows us to say whether there is a form of correlation between the two features, such that a high (resp., low) score of a comment arouses a positive (resp., negative) sentiment about it. This function allows us to select those patterns whose presence in comments with high (resp., low) scores is flanked by a positive (resp., negative) sentiment. We point out that this correlation between score and sentiment is not obvious for comments because there could exist comments with high (resp., low) score and null or negative (resp., positive) sentiment. In the following, we call 𝑓𝑝 the utility function computing the Pearson correlation. It is worth investigating both patterns having a positive value of 𝑓𝑝 and those having a negative value of that function. Indeed, a positive (resp., negative) value of 𝑓𝑝 indicates that there is a direct (resp., inverse) correlation between the sentiment aroused by a comment and the score it obtains. Consequently, we denote by 𝑓𝑝+ (resp., 𝑓𝑝− ) the function that selects those patterns having a value of 𝑓𝑝 greater (resp., lesser) than a threshold 𝑡ℎ+ 𝑝 (resp., 𝑡ℎ𝑝 ). − Figure 2 shows the trend of the number of extracted patterns as 𝑡ℎ− 𝑝 (resp., 𝑡ℎ𝑝 ) decreases + (resp., increases). Patterns are also grouped based on their length. This figure provides us with non-obvious and extremely interesting knowledge. In fact, the number of patterns extracted by 𝑓𝑝− is much greater than the one extracted by 𝑓𝑝+ . This allows us to say that, given a pattern, a positive (resp., negative) sentiment of it is not necessarily accompanied by a high (resp., low) score of the comments where it is present. This phenomenon is very evident for moderately positive or negative values of 𝑓𝑝 , while it reduces strongly for extreme values. It can be explained by considering that, given the nature of the reported texts, NSFW posts and comments tend to be associated with a negative sentiment by any sentiment analysis tool. This happens even when such terms are used in goliardic comments, which are actually appreciated by this type of audience. For instance, consider the text pattern {ℎ𝑜𝑡, 𝑓 𝑢𝑐𝑘}, possibly accompanied by an emoticon with two little hearts instead of eyes. Applying the sentiment analysis tools available in the literature to our dataset [5], we obtained a sentiment value of -0.1280 for this pattern. Instead, the corresponding comments have a very high score. As a consequence, the value of 𝑓𝑝 is negative. This allows us to obtain a first important outcome of our approach, namely that traditional sentiment computation tools do not work well in presence of NSFW posts and comments. As for the choice of the values of 𝑡ℎ+ 𝑝 and 𝑡ℎ𝑝 , all the reasoning above, and − the presence of a low number of extracted patterns, led us to choose low values for the two Figure 2: Number of extracted patterns against negative (at left) and positive (at right) 𝑓𝑝 thresholds. In particular, we set 𝑡ℎ+ 𝑝 = 0.1 and 𝑡ℎ𝑝 = −0.1. − We end this discussion by pointing out that many other utility functions could be defined on many different features concerning posts and users. This important peculiarity of our approach allows us to analyze the phenomenon of NSFW content from many different perspectives. 4. Analysis of User Interaction Networks In this section, we formally introduce the User Interaction Network and show how information can be derived from it. Let 𝒫𝑓 be the set of patterns obtained by applying the utility function 𝑓 , and let 𝒰𝑓 be the set of users who have written at least one comment or post containing at least one of the patterns of 𝒫𝑓 . A User Interaction Network 𝒩 𝑢𝑖 can be defined as: 𝒩 𝑢𝑖 = ⟨𝑁 𝑢𝑖 , 𝐴𝑢𝑖 ⟩. 𝑁 𝑢𝑖 is the set of nodes of 𝒩 𝑢𝑖 . There exists a node 𝑛𝑖 ∈ 𝑁 𝑢𝑖 for each user 𝑢𝑖 ∈ 𝒰𝑓 . 𝐴𝑢𝑖 is the set of arcs of 𝒩 𝑢𝑖 . An arc (𝑛𝑖 , 𝑛𝑗 , 𝑤𝑖𝑗 ) ∈ 𝐴𝑢𝑖 denotes that 𝑢𝑖 made comments on a post published by 𝑢𝑗 ; 𝑤𝑖𝑗 measures how many times this task happened. 𝒩 𝑢𝑖 allows us to characterize the behavior of users, who interact with each other by posting and commenting NSFW adult content. Table 1 lists some parameters of the User Interaction Networks built exploiting the dataset outlined in Section 2, the utility functions 𝑓𝑝+ and 𝑓𝑝− , and the thresholds 𝑡ℎ+𝑝 and 𝑡ℎ𝑝 . This − table confirms the information derived in Section 3. For instance, 𝒩𝑓𝑢𝑖− contains a larger number 𝑝 of nodes and arcs than 𝒩𝑓𝑢𝑖+ . Note that the cardinalities of 𝒫𝑓𝑝− and 𝒫𝑓𝑝+ follow the same trend as 𝑝 the number of nodes (and users) in the User Interaction Networks, even though they represent patterns instead of users. It is worth observing the high density coupled with a high clustering coefficient characterizing 𝒩𝑓𝑢𝑖+ . This tells us that, in this network, users tend to form very tight 𝑝 communities, whose structure resemble that of a clique. Instead, 𝒩𝑓𝑢𝑖− has a high density but 𝑝 a low clustering coefficient; this suggests the presence of very strong power users, i.e., users receiving comments from many other ones, who do not actually interact with each other. In both networks, the number of nodes composing the maximum connected component is very high, i.e., about 60%. Table 1 Values of some parameters for User Interaction Networks Parameter 𝒩 𝑢𝑖 − 𝒩 𝑢𝑖 + 𝑓𝑝 𝑓𝑝 Nodes 27,160 1,452 Arcs 60,662 7,925 Density 8.224e-05 376.15e-05 Clustering coefficient 0.004 0.129 Number of connected components 10,939 506 Size of the maximum connected component 16,030 891 Average weight of arcs 1.205 1.935 Average Indegree (weight ≥ 2) 16.634 14.393 Average Outdegree (weight ≥ 2) 6.03 7.473 Average Clustering coefficient (weight ≥ 2) 0.011 0.139 Average Indegree (All) 1.921 1.973 Average Outdegree (All) 1.912 1.987 Average Clustering coefficient (All) 0.003 0.039 The average number of times a user comments the content of another one is slightly higher than 1; this is confirmed by the very low average weight of arcs. In Figure 3, we investigated this aspect and found a power law distribution of the number of arcs against weights for 𝒩𝑓𝑢𝑖− . 𝑝 It implies that the distribution of the number of comments of a user against the contents posted by another one follows a power law. An analogous result holds for 𝒩𝑓𝑢𝑖+ . We determined the 𝑝 parameters 𝛼 and 𝛿 of these power law distributions; they are: 𝛼 = 1.371, 𝛿 = 0.062 for 𝒩𝑓𝑢𝑖− 𝑝 and 𝛼 = 1.507, 𝛿 = 0.063 for 𝒩𝑓𝑢𝑖+ . The results of these analyses show that there is a low 𝑝 number of pairs of users in which one of them comments a post of the other at least twice (we call them interacting users in the following). This can be considered a minimum condition to detect non-random relationships between pairs of users. We compared the values of the average indegree, outdegree and clustering coefficient of the interacting users, on one hand, and all the users, on the other hand. The bottom of Table 1 shows this comparison for the two considered networks. It is easy to observe that, in both networks, interacting users have indegrees and outdegrees much higher than the other users. Therefore, they can be considered power users. Also, their clustering coefficient is very high, indicating that they are able to promote the generation of communities. Therefore, it can be said that they are community leaders in the distribution of NSFW adult content in Reddit. At this point, we found it interesting to investigate the possible existence of mutual relation- ship between interacting users. For this purpose, we determined the fraction of interacting users such that a user 𝑢𝑖 comments the posts of a user 𝑢𝑗 , and vice versa. We found that this fraction is low (i.e., 0.141) for 𝒩𝑓𝑢𝑖− , whereas it is higher (i.e., 0.433) for 𝒩𝑓𝑢𝑖+ . Furthermore, although 𝑝 𝑝 the number of nodes of 𝒩𝑓𝑢𝑖+ is considerably lower than that of 𝒩𝑓𝑢𝑖− , the two networks have 𝑝 𝑝 similar average indegree and outdegree for both normal and interacting users. Moreover, 𝒩𝑓𝑢𝑖+ 𝑝 has a much higher clustering coefficient and a much higher fraction of interacting users than 𝒩𝑓𝑢𝑖− . Based on all these outcomes, we can state that, although the number of users of 𝒩𝑓𝑢𝑖− is 𝑝 𝑝 much greater than that of 𝒩𝑓𝑢𝑖+ , these last ones have a higher attitude to be opinion leaders. 𝑝 Taking also the utility function underlying 𝒩𝑓𝑢𝑖+ into account, we can deduce that the users of 𝑝 this network are particularly able to maintain a positive correlation between the sentiment of Figure 3: Distribution of arcs against weights for 𝒩𝑓𝑢𝑖− (log-log scale) 𝑝 their comments and the associated scores. Finally, the results obtained so far allow us to state that the users of 𝒩𝑓𝑢𝑖+ are the most dynamic ones, as they publish posts attracting interest (since 𝑝 they are commented by other users), and comment the posts of the other users. This feature makes them particularly important, as they are not only content producers, but also dynamic participants who contribute to maintain their communities active, and act as opinion leaders for these communities. We call them proactive users. 5. Conclusion This paper presented an approach to analyze NSFW users, comments and posts on Reddit, taking into account and exploiting the knowledge extracted by investigating text patterns. The methods and results considered in this paper can represent the basis for many new applications. In fact, posts and subreddits of other target categories (e.g., vegetarian or vegan users) could be examined by means of the same methodology. Moreover, the analysis of text patterns could represent the engine of an automatic classifier aiming at tagging posts and comments containing unsuitable content. Furthermore, our approach can be adapted to other social networks managing NSFW content less explicitly than Reddit. Finally, we plan to design an automatic tool that exploits a knowledge base built by integrating utility patterns and semantic analysis tools to automatically classify new contents, and, then, suggest the most pertinent communities which they should be directed to. References [1] K. Tiidenberg, Boundaries and conflict in a NSFW community on tumblr: The meanings and uses of selfies, New Media & Society 18 (2016) 1563–1578. Sage Publications. [2] J. Matias, Going dark: Social factors in collective action against platform operators in the Reddit blackout, in: Proc. of the International Conference on Human Factors in Computing Systems (ACM CHI 2016), San Jose, CA, USA, 2016, pp. 1138–1151. ACM. [3] B. K. Narayanan, M. Nirmala, Adult content filtering: Restricting minor audience from accessing inappropriate Internet content, Education and Information Technologies 23 (2018) 2719–2735. Springer. [4] E. Corradini, A. Nocera, D. Ursino, L. Virgili, Investigating the phenomenon of NSFW posts in Reddit, Information Sciences 566 (2021) 140–164. Elsevier. [5] C. Hutto, E. Gilbert, Vader: A parsimonious rule-based model for sentiment analysis of social media text, in: Proc. of the International AAAI Conference on Weblogs and Social Media (ICWSM’14), Ann Arbor, MI, USA, 2014, pp. 216–225. [6] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, J. Blackburn, The pushshift Red- dit dataset, in: Proc. of the International AAAI Conference on Web and Social Media (ICWSM’20), volume 14, Atlanta, GA, USA, 2020, pp. 830–839. AAAI Press. [7] P. Fournier-Viger, J. C.-W. Lin, B. Vo, T. Chi, J. Zhang, H. Le, A survey of itemset mining, WIREs Data Minining and Knowledge Discovery 7 (2017) e1207. doi:https://doi.org/ 10.1002/widm.1207, Wiley. [8] C. Aggarwal, M. Bhuiyan, M. A. Hasan, Frequent pattern mining algorithms: A survey, in: J. H. C. Aggarwal (Ed.), Frequent Pattern Mining, 2014, pp. 19–64. doi:10.1007/ 978-3-319-07821-2\_2, springer, Cham. [9] P. Fournier-Viger, J.-W. Lin, R. Nkambou, B. Vo, V. Tseng, High-Utility Pattern Mining, 2019. Springer. [10] W. Gan, C. Lin, P. Fournier-Viger, H. Chao, V. Tseng, P. Yu, A survey of utility-oriented pattern mining, IEEE Transactions on Knowledge and Data Engineering 33 (2021) 1306– 1327. doi:https://doi.org/10.1109/TKDE.2019.2942594, iEEE. [11] K. Pearson, Note on regression and inheritance in the case of two parents, Proceedings of the Royal Society of London 58 (1895) 240–242. The Royal Society.