1. Introduction

Pizzo Calabro (VV), Italy " cauteruccio@mat.unical.it (F. Cauteruccio); e.corradini@pm.univpm.it (E. Corradini); terracina@mat.unical.it (G. Terracina); d.ursino@univpm.it (D. Ursino); l.virgili@pm.univpm.it (L. Virgili)

An investigation on Not Safe For Work adult content in Reddit

(Discussion Paper)

Francesco Cauteruccio

Enrico Corradini

Giorgio Terracina

Domenico Ursino

Luca Virgili

1 0 DEMACS, University of Calabria , Italy 1 DII, Polytechnic University of Marche , Italy

2021

000 0 0001

Reddit is one of the few social platforms that handles NSFW (Not Safe For Work) content in an explicit and well-structured way. Despite this fact, such an issue has been very neglected in the past by researchers who have studied this social network. In this paper, we aim at providing a contribution in this setting by proposing an approach to extract and analyze text patterns from NSFW content in Reddit. An important peculiarity of our approach is that patterns are extracted not only based on their frequency (as it generally happens in the past literature), but also, and especially, on one or more utility measures.

eol>Reddit NSFW posts and comments Text patterns Pattern utility measures Social Network Analysis

1. Introduction

Data Cleaning and Annotation

Pattern Extraction and

Enrichment Network-based Pattern Analysis Removing bot's content, cleaning text Sentiment annotation Lexical annotation Frequent pattern mining Filter patterns by utility Enrich patterns with related data information Building of pattern-based networks and user-based networks Network analysis and (iii) Network-based Pattern Analysis. Applying our approach on Reddit allowed us to make several contributions to this research scenario. They involve: (i) discovering that traditional approaches to sentiment computation are unreliable in the context of NSFW adult content; (ii) defining and finding opinion leaders in real communities sharing NSFW adult content; (iii) discovering text patterns representing the building blocks of NSFW posts and comments on Reddit; (iv) determining new virtual communities of users sharing NSFW adult content; (v) identifying opinion leaders who could influence such communities.

The rest of this paper is structured as follows: Section 2 provides a general description of our approach and the dataset used for our experiments. Section 3 illustrates the Pattern Extraction and Enrichment step. Section 4 describes the Network-based Pattern Analysis step. Finally, Section 5 presents our conclusions and possible future developments of our research eforts.

2. General overview of our approach

The general workflow of our approach is shown in Figure 1, which highlights the three steps composing it (i.e., Data Cleaning and Annotation, Pattern Extraction and Enrichment, and Network-based Pattern Analysis).

The Data Cleaning and Annotation step removes irrelevant content and standardizes text representations. It also performs lexical (e.g., part-of-speech and named entities) and sentiment annotations. These latter highlight the polarity of sentiments expressed in the texts, represented in terms of a compound score, computed by applying Vader [ 5 ]. Due to space limitations, we do not illustrate this step in detail in this paper.

The Pattern Extraction and Enrichment step extracts a set of text patterns from the posts and comments identified in the previous step; they are the basis for the next Network-based Pattern Analysis step. To this end, it first extracts frequent patterns. Then, it associates each pattern with a rich set of features regarding the posts and comments it derives from, as well as the users who published them. Afterwards, it defines some utility measures and associates the corresponding values with each pattern. Finally, it selects only those patterns with high frequency and high utility. Our approach allows the definition of diferent concepts and utility measures and, consequently, the selection of diferent sets of useful patterns based on them. This allows us to analyze the available NSFW content from very diferent perspectives, yet adopting a uniform methodology.

The Network-based Pattern Analysis step applies the concepts and approaches of Social Network Analysis to the patterns obtained during the previous step with the goal of extracting information and knowledge from them. Specifically, it constructs and uses three social networks, namely: (i) User Interaction Network, in which a node represents a user , who published at least one post or comment. An arc (, , ) denotes that commented a post of ; indicates how many times this happened. (ii) Pattern Network, in which a node represents a pattern extracted in the previous step. An arc (, , ) indicates that and were adopted by at least one user in common; indicates the number of users who adopted both and . (iii) User Content Network, in which a node represents a user , who published at least one post or comment. An arc (, , ) indicates that there is at least one comment posted by and at least one comment posted by containing the same pattern; denotes the number of times this happened. Once these networks are built, this step proceeds by applying Social Network Analysis concepts and approaches to them for extracting information and knowledge on Reddit users publishing, commenting and reading NSFW adult posts and on the content these users exchange. Due to space limitations, in this paper, we focus only on the User Interaction Network.

To perform our experiments for evaluating our approach, we downloaded a dataset from the pushshift.io [ 6 ] website, which represents one of the main data repositories for Reddit. Specifically, we considered 449 NSFW adult subreddits listed at https://www.reddit.com/r/ ListOfSubreddits/wiki/nsfw and extracted all posts and the corresponding comments published on them from January 1, 2020 to March 31, 2020. The number of posts on the dataset is 3,064,758, while the number of comments is 11,627,372.

3. Pattern Extraction and Enrichment

This step extracts text patterns from posts and comments in the dataset and, then, enriches them with additional information concerning their frequency and utility. Pattern mining plays a key role in this activity. It is a well known task in the literature, which extracts from a dataset some (hopefully interesting and/or unexpected) information that can be understood by humans.

Many pattern mining approaches are based on the concept of pattern frequency and aim at identifying the most frequent patterns in the texts received as input. They are based on the assumption that frequent patterns are interesting [ 7, 8 ]. This is true in many application contexts. However, there are cases where it does not hold. To handle these cases, the notion of pattern utility has been introduced. It shifts the emphasis from frequent pattern mining to High Utility Pattern Mining (hereafter, HUPM) [ 9, 10 ]. In this case, a utility function is defined; the patterns with a high value of this function are considered interesting. Recall that a utility function denotes a user preference ordering over a set of choices [ 9 ]. It is clearly a subjective measure allowing us to state the usefulness of a text pattern from diferent perspectives, depending on our preferences and/or needs.

After having introduced the concepts of frequency and utility of a pattern, we can illustrate our approach to pattern extraction and the model which it operates on. Let = {1, 2, · · · , } be a set of lemmatized comments, obtained at the end of the Data Cleaning and Annotation step. Each comment ∈ corresponds to a post and is written by a Reddit user. We can represent as a set of lemmas = {1, 2, · · · , }. Therefore, if we denote by ℒ = {1, 2, · · · , } the set of all possible lemmas, then is a subset of ℒ. From the HUPM perspective, each lemma is an item. A pattern is a set of items and, therefore, ⊆ ℒ . can occur in zero, one or more comments in our dataset. We denote by ⊆ the subset of the comments of in which is present, and by frequency of the cardinality of . inherits the set of features characterizing the comments of , and the utility of can be defined as an appropriate function of all or some of these features. The choice of the features and the utility function to adopt determine the perspective one whishes to consider in the analysis of patterns.

For example, consider the features score_comm (denoting the score of a comment) and compound (indicating the sentiment value extracted from the text of a comment). Suppose that the utility function is the Pearson’s correlation [ 11 ] between them, which allows us to say whether there is a form of correlation between the two features, such that a high (resp., low) score of a comment arouses a positive (resp., negative) sentiment about it. This function allows us to select those patterns whose presence in comments with high (resp., low) scores is flanked by a positive (resp., negative) sentiment. We point out that this correlation between score and sentiment is not obvious for comments because there could exist comments with high (resp., low) score and null or negative (resp., positive) sentiment. In the following, we call the utility function computing the Pearson correlation. It is worth investigating both patterns having a positive value of and those having a negative value of that function. Indeed, a positive (resp., negative) value of indicates that there is a direct (resp., inverse) correlation between the sentiment aroused by a comment and the score it obtains. Consequently, we denote by + (resp., − ) the function that selects those patterns having a value of greater (resp., lesser) than a threshold ℎ+ (resp., ℎ− ).

Figure 2 shows the trend of the number of extracted patterns as ℎ− (resp., ℎ+) decreases (resp., increases). Patterns are also grouped based on their length. This figure provides us with non-obvious and extremely interesting knowledge. In fact, the number of patterns extracted by − is much greater than the one extracted by +. This allows us to say that, given a pattern, a positive (resp., negative) sentiment of it is not necessarily accompanied by a high (resp., low) score of the comments where it is present. This phenomenon is very evident for moderately positive or negative values of , while it reduces strongly for extreme values. It can be explained by considering that, given the nature of the reported texts, NSFW posts and comments tend to be associated with a negative sentiment by any sentiment analysis tool. This happens even when such terms are used in goliardic comments, which are actually appreciated by this type of audience. For instance, consider the text pattern {ℎ, }, possibly accompanied by an emoticon with two little hearts instead of eyes. Applying the sentiment analysis tools available in the literature to our dataset [ 5 ], we obtained a sentiment value of -0.1280 for this pattern. Instead, the corresponding comments have a very high score. As a consequence, the value of is negative. This allows us to obtain a first important outcome of our approach, namely that traditional sentiment computation tools do not work well in presence of NSFW posts and comments. As for the choice of the values of ℎ+ and ℎ− , all the reasoning above, and the presence of a low number of extracted patterns, led us to choose low values for the two thresholds. In particular, we set ℎ+ = 0.1 and ℎ− = − 0.1.

We end this discussion by pointing out that many other utility functions could be defined on many diferent features concerning posts and users. This important peculiarity of our approach allows us to analyze the phenomenon of NSFW content from many diferent perspectives.

4. Analysis of User Interaction Networks

In this section, we formally introduce the User Interaction Network and show how information can be derived from it. Let be the set of patterns obtained by applying the utility function , and let be the set of users who have written at least one comment or post containing at least one of the patterns of . A User Interaction Network can be defined as: = ⟨ , ⟩. is the set of nodes of . There exists a node ∈ for each user ∈ . is the set of arcs of . An arc (, , ) ∈ denotes that made comments on a post published by ; measures how many times this task happened.

allows us to characterize the behavior of users, who interact with each other by posting and commenting NSFW adult content.

Table 1 lists some parameters of the User Interaction Networks built exploiting the dataset outlined in Section 2, the utility functions + and − , and the thresholds ℎ+ and ℎ− . This contains a larger number table confirms the information derived in Section 3. For instance, − of nodes and arcs than + . Note that the cardinalities of − and + follow the same trend as the number of nodes (and users) in the User Interaction Networks, even though they represent patterns instead of users. It is worth observing the high density coupled with a high clustering coeficient characterizing + . This tells us that, in this network, users tend to form very tight has a high density but communities, whose structure resemble that of a clique. Instead, − a low clustering coeficient; this suggests the presence of very strong power users, i.e., users receiving comments from many other ones, who do not actually interact with each other. In both networks, the number of nodes composing the maximum connected component is very high, i.e., about 60%. parameters and of these power law distributions; they are: = 1.371, = 0.062 for − and = 1.507, = 0.063 for + . The results of these analyses show that there is a low number of pairs of users in which one of them comments a post of the other at least twice (we call them interacting users in the following). This can be considered a minimum condition to detect non-random relationships between pairs of users. We compared the values of the average indegree, outdegree and clustering coeficient of the interacting users, on one hand, and all the users, on the other hand. The bottom of Table 1 shows this comparison for the two considered networks. It is easy to observe that, in both networks, interacting users have indegrees and outdegrees much higher than the other users. Therefore, they can be considered power users. Also, their clustering coeficient is very high, indicating that they are able to promote the generation of communities. Therefore, it can be said that they are community leaders in the distribution of NSFW adult content in Reddit.

At this point, we found it interesting to investigate the possible existence of mutual relationship between interacting users. For this purpose, we determined the fraction of interacting users such that a user comments the posts of a user , and vice versa. We found that this fraction , whereas it is higher (i.e., 0.433) for + . Furthermore, although is low (i.e., 0.141) for − the number of nodes of + is considerably lower than that of − , the two networks have similar average indegree and outdegree for both normal and interacting users. Moreover, + has a much higher clustering coeficient and a much higher fraction of interacting users than . Based on all these outcomes, we can state that, although the number of users of − is − much greater than that of + , these last ones have a higher attitude to be opinion leaders. Taking also the utility function underlying + into account, we can deduce that the users of this network are particularly able to maintain a positive correlation between the sentiment of their comments and the associated scores. Finally, the results obtained so far allow us to state that the users of + are the most dynamic ones, as they publish posts attracting interest (since they are commented by other users), and comment the posts of the other users. This feature makes them particularly important, as they are not only content producers, but also dynamic participants who contribute to maintain their communities active, and act as opinion leaders for these communities. We call them proactive users.

5. Conclusion

This paper presented an approach to analyze NSFW users, comments and posts on Reddit, taking into account and exploiting the knowledge extracted by investigating text patterns. The methods and results considered in this paper can represent the basis for many new applications. In fact, posts and subreddits of other target categories (e.g., vegetarian or vegan users) could be examined by means of the same methodology. Moreover, the analysis of text patterns could represent the engine of an automatic classifier aiming at tagging posts and comments containing unsuitable content. Furthermore, our approach can be adapted to other social networks managing NSFW content less explicitly than Reddit. Finally, we plan to design an automatic tool that exploits a knowledge base built by integrating utility patterns and semantic analysis tools to automatically classify new contents, and, then, suggest the most pertinent communities which they should be directed to.

[1]

Tiidenberg , Boundaries and conflict in a NSFW community on tumblr: The meanings and uses of selfies , New Media & Society 18 ( 2016 ) 1563 - 1578 . Sage Publications.

[2]

Matias , Going dark: Social factors in collective action against platform operators in the Reddit blackout , in: Proc. of the International Conference on Human Factors in Computing Systems (ACM CHI 2016 ), San Jose, CA, USA, 2016 , pp. 1138 - 1151 . ACM.

[3]

B. K.

Narayanan ,

Nirmala , Adult content filtering: Restricting minor audience from accessing inappropriate Internet content , Education and Information Technologies 23 ( 2018 ) 2719 - 2735 . Springer.

[4]

Corradini ,

Nocera ,

Ursino , L. Virgili, Investigating the phenomenon of NSFW posts in Reddit , Information Sciences 566 ( 2021 ) 140 - 164 . Elsevier.

[5]

Hutto , E. Gilbert, Vader: A parsimonious rule-based model for sentiment analysis of social media text , in: Proc. of the International AAAI Conference on Weblogs and Social Media (ICWSM'14) , Ann Arbor, MI, USA, 2014 , pp. 216 - 225 .

[6]

Baumgartner ,

Zannettou ,

Keegan ,

Squire , J. Blackburn, The pushshift Reddit dataset , in: Proc. of the International AAAI Conference on Web and Social Media (ICWSM'20) , volume 14 , Atlanta , GA , USA, 2020 , pp. 830 - 839 . AAAI Press.

[7]

Fournier-Viger ,

J. C.-W.

Lin ,

Vo ,

Chi ,

Zhang ,

Le , A survey of itemset mining , WIREs Data Minining and Knowledge Discovery 7 ( 2017 ) e1207 . doi:https://doi.org/ 10.1002/widm.1207, Wiley.

[8]

Aggarwal ,

Bhuiyan ,

M. A.

Hasan , Frequent pattern mining algorithms: A survey , in: J. H. C. Aggarwal (Ed.), Frequent Pattern Mining , 2014 , pp. 19 - 64 . doi: 10 .1007/ 978-3- 319 -07821-2\_2, springer, Cham.

[9]

Fournier-Viger ,

J.-W.

Lin ,

Nkambou ,

Vo ,

Tseng , High-Utility Pattern

Mining

, 2019 . Springer.

[10]

Gan ,

Lin ,

Fournier-Viger ,

Chao ,

Tseng ,

Yu , A survey of utility-oriented pattern mining , IEEE Transactions on Knowledge and Data Engineering 33 ( 2021 ) 1306 - 1327 . doi:https://doi.org/10.1109/TKDE. 2019 . 2942594 , iEEE.

[11]

Pearson , Note on regression and inheritance in the case of two parents , Proceedings of the Royal Society of London 58 ( 1895 ) 240 - 242 . The Royal Society.