1. Introduction

ConversationMoC: Encoding Conversational Dynamics using Multiplex Network for Identifying Moment of Change in Mood and Mental Health Classification⋆

Loitongbam Gyanendro Singh

Stuart E. Middleton

Tayyaba Azim

Elena Nichele

Pinyi Lyu

Santiago De Ossorno Garcia

2 0 Department of Management, College of Arts Social Sciences & Humanities, University of Lincoln , UK 1 School of Electronics and Computer Science, University of Southampton , UK 2 Universidad Complutense de Madrid , Spain

Understanding mental health conversation dynamics is crucial, yet prior studies often overlooked the intricate interplay of social interactions. This paper introduces a unique conversation-level dataset and investigates the impact of conversational context in detecting Moments of Change (MoC) in individual emotions and classifying Mental Health (MH) topics in discourse. In this study, we diferentiate between analyzing individual posts and studying entire conversations, using sequential and graph-based models to encode the complex conversation dynamics. Further, we incorporate emotion and sentiment dynamics with social interactions using a graph multiplex model driven by Graph Convolution Networks (GCN). Comparative evaluations consistently highlight the enhanced performance of the multiplex network, especially when combining reply, emotion, and sentiment network layers. This underscores the importance of understanding the intricate interplay between social interactions, emotional expressions, and sentiment patterns in conversations, especially within online mental health discussions. We are sharing our new dataset (ConversationMoC) and codes with the broader research community to facilitate further research1.

eol>Mental health conversation dynamics Moments of Change (MoC) Emotional expressions Graph Convolution Networks (GCN) Multiplex network

1. Introduction

has yielded superior outcomes in tasks such as ques- • A new Reddit dataset, augmented with convertion answering [13, 14] and personalized recommenda- sational context and carefully annotated for use tion [15, 16]. Additionally, graph-based representations in Moments of Change (MoC) and Mental Health have proven beneficial in other conversation tasks such (MH) discourse classification, is now publicly as dialogue act recognition [11, 17], intent detection [18], available for the first time. This dataset introand topic modeling [19], contributing to improved perfor- duces an important development in identifying mance across these domains. These findings highlight the MoC using a valence and arousal space. potential of utilizing network structures to improve the • This study extensively compares suitable baseunderstanding and performance of diverse conversation- line models over the new MoC dataset. Further, related tasks. to encode the complex conversation dynamics, a

Inspired by prior research, this study explores the po- multiplex network structure is introduced, captential of leveraging social and meta-interaction informa- turing the intricate interplay between social intion for mental health tasks, including identifying MoC teractions, emotional expressions, and sentiment in an individual’s mood and classifying MH discourse. patterns within conversations, emphasizing the Notably, no existing datasets specifically address MoC uniqueness of this research. detection with a full conversation context, underscoring • A comprehensive exploration of the multiplex the novelty and importance of this study. To facilitate layers, determining the significance of each layer our investigation, we have curated a new dataset com- for conversational MoC and MH classification prising 967 conversations covering 15 MH topics sourced tasks. from the Reddit social media platform (explained in Section 3.2). This dataset ofers insights into the intricate The rest of the paper is organized as follows: Section 2 interplay between language use and social interactions. provides an overview of related work. Section 3 discusses Further, to encode the complex conversation dynamics, in detail the dataset curation. Section 4 discusses the we utilize a multiplex network representation of conver- experiment designs. Section 5 presents the experimental sations, wherein each layer captures diferent aspects of results and discussion, and finally, the study concludes the conversation, such as emotion, sentiment, and reply in Section 6. interactions (refer Section 3 for detailed discussion). By introducing a novel dataset and highlighting the signifi- 2. Related studies cance of representing conversation context via multiplex networks, this study aims to uncover hidden emotional dynamics and understand the impact of social interac- 2.1. Moment of change detection tions on individual mood shifts. Throughout the paper, Various studies have investigated the connection between the individual who starts the conversation is referred to changes in user language on social media platforms and as the target user, and other participants as non-target their mental health, specifically identifying significant users. transitions or shifts in sentiment and/or emotion states.

A comprehensive evaluation is performed to assess Work includes exploring language changes to establish a the efectiveness of the proposed study in detecting MoC foundation for detecting the MoC by analyzing sequenand identifying types of MH topics in discourse. Using tial textual content [20, 21]. The CLPsych Shared Task suitable sequential and graph-based baseline models, the 2022 [ 22, 5, 23 ] further emphasized detecting MoC and significance of incorporating conversation is evaluated User Mental Health Risk identification tasks, where inby comparing the model’s performance with and with- corporating pre-trained BERT-based models with BiLout the conversation’s contextual information. Further, STM frameworks [ 6, 24 ] showed promising performance the significance of incorporating multiplex networks is on a TalkLife dataset without full conversation context thoroughly explored by comparing the model’s perfor- (i.e. target users only). The above studies have exammance for each multiplex layer. The experimental results ined changes in language patterns of target users to infer reveal the substantial benefits of leveraging conversa- shifts in psychological well-being, stress levels, and emotion contextual information for MoC detection, ofering tional states, providing insights into the dynamics of a more accurate understanding of the target user’s mood mood change over time. However, the conversation of shift and MH classification tasks. Additionally, the in- other users with the target users is overlooked in the clusion of conversation multiplex network information, above studies. particularly the reply and sentiment graphs, significantly enhances the performance of the proposed model, as 2.2. Mental health disorder classification demonstrated by the results in Table 2. In summary, this study has the following contributions: Numerous studies have explored the utilization of selfreporting posts on social media platforms like Reddit and Twitter as valuable resources for detecting mental Escalation (IE) health (MH) disorders [ 1, 4, 25 ]. Distant supervision S w(ISitc)h AROUSAL (active) S w(ISitc)h has emerged as a popular approach, thanks to its costefectiveness and ability to capture the rich expressive dynamics of MH disorders. Commonly studied disor- Anger Joy ders include schizophrenia, bipolar disorder, depression, Escalation VALENCE Escalation anxiety, suicide, eating disorders, and Post-Traumatic (IE) (negative) (positive) (IE) Stress Disorder (PTSD). Previous studies have employed Sad Optimism n-gram feature engineering methods within a multitask learning framework [26] to classify each MH disorder as a separate task, while others treat all disorders as a single Switch (IS) (passive) Switch (IS) classification task [ 4 ]. Recent approaches have lever- Escalation (IE) aged fine-tuning of pre-trained BERT models [ 27, 3 ] and Figure 2: 2D Valency Arousal Space depicting the moment of prompt-based masked language models [28, 29] for MH change in mood reflected through user posts. The diagonal classification task. However, these studies have primarily shift represents Switch ( IS), while the horizontal or vertical focused on classifying MH disorders based solely on the shift represents Escalation ( IE). target user’s posts. In contrast to the previous works that focused solely on a target user’s sequence of posts, this study underscores the signicfiance of considering con- 3.1. Mental Health Subreddits Selection textual conversation information. By incorporating the contextual information, we aim to gain a more compre- In this study, our data collection eforts were directed hensive understanding of the conversation to accurately towards 15 distinct mental health (MH) subreddits4, each identify Moments of Change (MoC) and classify Mental delving into a wide spectrum of MH topics. The selection Health (MH) disorder topics in a target user’s discourse. of these MH topics was meticulously guided by prior re

In a similar direction concerning mental health-related search, particularly a study conducted by Low et al. [ 30 ]. tasks, [ 8 ] highlights the significance of comprehending This seminal research ofered valuable insights into the conversational dynamics when identifying posts indicat- prevalence and importance of diverse themes in online ing suicidal ideation. Their work primarily centers on mental health (MH) discussions. However, it did not addetermining whether a post contains suicide ideation dress the specific task of detecting Moments of Change information. In contrast, our approach revolves around (MoC), laying the groundwork for our dataset curation. It tracking the temporal evolution of a target user’s posts to is important to note that our dataset difers in terms of its identify the MoC of the target user’s moods. Furthermore, time frame, spanning from November 1, 2018, to Novemthis study exploits multiplex graphs capturing various ber 1, 2019, thus ofering a distinct temporal context. By conversation aspects, such as social interactions, emo- encompassing these diverse MH topics, we aimed to captional expressions, and sentiment patterns, to provide a ture a comprehensive and representative snapshot of MH more nuanced understanding of the conversation dynam- discussions within various online communities. The 15 ics. This insight highlights the distinction and depth of MH topics are listed in Table 1. our contributions in the context of conversational analysis and MH detection tasks. 3.2. Data Collection

3. Dataset overview

This section presents a detailed overview of the dataset utilized in our study, which has been collected from the Reddit social media platform using the Pushshift API3. This dataset has been curated to facilitate research in the ifeld of classifying mental health discourse and temporal moment of change (MoC) detection. For ease of reference, we named this dataset as ConversationMoC. In the following subsections, we will delve into the dataset’s composition, the data collection process, and the unique attributes that make it a valuable resource for investigating mental health-related conversational dynamics.

3https://github.com/pushshift/api

We collected data focusing on the posts that initiated conversations to compile our dataset5. Each user’s timeline constitutes a chronological record of their conversations, encompassing their posts and replies from other users. In this study, we use the term post to refer to both user comments and the initiating posts. To ensure meaningful and comprehensive data, we specifically selected conversations in which the target user contributed at least two posts, allowing us to examine the conversation dynamics efectively. Table 1 presents the dataset distribution, which consists of 963 target users participating in 967 conversations – 11,841 users contributed 28,659 posts, 4A subreddit is a thematic community on Reddit that focuses on specific topics. 5We hypothesize that the target user is either sufering or interested to know about the subject.

Addiction ADHD Alcoholism Anxiety Autism Bipolar BPD Depression Eating Disorder Health Anxiety Loneliness PTSD Schizophrenia Social Anxiety Suicide Total Unique Convs #Posts (#Users)

Target Users Avg IS posts /Convs with 9,221 posts from the 963 target users. levels change. When the emotion remains unchanged or neutral throughout a conversation, it is labeled O. The 3.3. Data Annotation use of VA space allows a more structured assessment of IS and IE and is less subjective than relying on simple Three annotators with educational backgrounds in Psy- annotator label judgments of mood change as in [ 22, 5 ]. chology and Computer Science were recruited to anno- The annotators achieved a near-perfect agreement, tate the MoC in the new dataset. They were given a with a mean Cohen’s Kappa score6 of 0.808 across all detailed briefing on the task, which involved determin- 15 subreddits. Conflicts in annotations were resolved ing the mood or emotion expressed in each sentence of through a majority voting criterion, with the final manthe target user’s posts. The annotators identified a domi- ual label determined by one annotator, who acted as the nant mood for each user’s posts (anger, sad, joy, optimism, chairperson, having a deeper understanding of the conand neutral), which was the basis for determining MoC text and similarities to other shared tasks. From Table 1, between consecutive posts. The task is defined as a three- it can be seen that the distribution of annotations for class classification problem: Switch ( IS), Escalation (IE), IE, IS, and O are highly imbalanced, reflecting the real and No MoC (O) following the annotation scheme of scenario where emotional switches (IS) are infrequent, [ 22, 5 ]. IS represents abrupt changes in an individual’s and escalations (IE) occur less frequently than relative emotional state, while IE signifies the evolving nature stability (O). This distribution aligns with the finding that of mood changes. O indicate relative stability, i.e., no user posts commonly show stable moods. noticeable shifts in the user’s mood.

The Valence and Arousal (VA) chart (shown in Figure 2) is considered to annotate IS and IE, representing 4. Methodology afective states in a continuous numerical VA space. According to the Circumplex model [ 31 ], transitions in the This study delves into the performance evaluation of VA space, such as moving from Anger to Sad or Anger to the state-of-the-art sequential and graph-based models Joy and vice versa, either horizontally or vertically, corre- on the novel ConversationMoC dataset. Additionally, it spond to Emotional Escalation (IE). Conversely, diagonal explores the potential of leveraging social and metatransitions, like going from Sad to Joy or Anger to Op- interaction information through a multiplex network timism and vice versa, indicate Emotional Switch (IS). In structure, where each layer captures distinct aspects of simpler terms, for escalation, either the level of valence the conversation, including emotion, sentiment, and reply or arousal remains the same even if the emotion changes.

In contrast, for a switch, both the valence and arousal 6https://en.wikipedia.org/wiki/Cohen’s_kappa

Post-to-post Multiplex graph (A)

Moment of Change (MoC) classification

Softmax(Target user posts)

Mental health

Discourse Classification

Softmax(FlattenedEMB)

Flattening Masked non-target user post embedding

Multi-head attention layer Conversation encoded

post embedding Multiplex graph learning model

Temporal encoded

Post embedding Sequence-to-sequence learning model

Post embedding

Conversationi posts (P) 4.1. Post embedding relations. Figure 3 shows an overview of this experimen- target user’s moods, we utilize a Bidirectional Long Shorttal framework, demonstrating how conversation dynam- Term Memory (BiLSTM) model [ 35, 36 ] as the fundamenics are encoded. This can be achieved using a standalone tal component of the sequential representation model. sequential model, a graph-based model, or a combination The BiLSTM layer processes the input sequence of posts of both. The following subsections provide an in-depth encoded using of-the-shelf pre-trained models (discussed exploration of the evaluation framework. in Section 4.1), denoted as = {1, 2, ..., }, where each represents an individual post. Mathematically, the BiLSTM network is defined as follows: This study considers the concatenation of the pretrained embeddings using averaged fastText word em- (1) bedding [ 32 ], Sentence-BERT (SBERT) [ 33 ], and task- ℎ = [ℎ→, ℎ← ] specific pre-trained RoBERTa-base models [ 34 ]7 for semantic representation of individual posts. These pre- where is the semantic embedding of the post , ℎ→ trained embedding models have been utilized in various and ℎ← represent the hidden states of the forward and studies [ 5, 6, 24 ] and demonstrated superior performance backward LSTMs, →− 1 and ←+1 are the previous cell in the CLPsych2022 shared task [22]. Several prepro- states of the forward and backward LSTMs, and ℎ repcessing steps were performed before applying the post- resents the temporal enhanced post-embedding, which is embedding, such as normalizing keywords, anonymizing a concatenation of the hidden states from both the forusers8, converting to lowercase, and removing URL links. ward and backward LSTMs. The BiLSTM layer processes the input sequence sequentially, updating the hidden 4.2. Sequential Representation states ℎ and cell states at each time step . This allows the model to capture the sequential information in To model the sequential progression of posts within a the conversation, capturing the temporal dependencies conversation and capture temporal dependencies in the between posts and enabling a better understanding of the user’s mood dynamics over time.

ℎ→ = LSTM→(, ℎ→− 1, →− 1) ℎ← = LSTM← (, ℎ←+1, ←+1)

7https://huggingface.co/cardifnlp/twitter-roberta-base-sentiment 8Converting original user name to @username

Conversation input posts multiplex network (A) Reply Emotion Sentiment Conversation input posts in temporal order (P) 2-layer GCN Shared weights 2-layer GCN Shared weights 2-layer GCN (Temporal Encoded) Post embedding H(0)

M a x p o o il n g

H oC n v e rs a it o n e n c o d e d p o s t e m b e d d i n g 4.3. Multiplex Graph Representation Social media conversations are inherently non-linear, marked by users responding to earlier and recent posts, potentially influencing the mood or emotion of future posts. Figure 4 shows a conversation’s multiplex network structure representation using a two-layer Graph Convolutional Network (GCN). This approach captures this non-linearity by introducing a multiplex network consisting of reply, sentiment, and emotion network layers.

Specifically, the

reply layer focuses on linking posts involved in social interactions between users. The emotion and sentiment layers are constructed by linking posts with similar emotions and sentiments, classified using the pre-trained RoBERTa-based emotion and sentiment models [ 34 ]. The GCN model efectively encodes the dependencies between each layer and the social and metaLet , , and represent the adjacency matrices of the reply, emotion, and sentiment layers, including selfloops. Mathematically, the -layer GCN propagation over the layers multiplex network can be defined as follows: where each row of − 1 matrix is the input postembedding at GCN layer , denotes the Rectified Linear Unit activation function, while represents the degree of nodes in the ℎ multiplex layer. () is the weight matrix at layer , which is learned during the training process. The weights () are shared across all layers. By updating the shared weight matrix () during the training process, the GCN model assigns diferent importance to diferent layers of the multiplex network.

Further, by applying max pooling, the GCN allows the network to capture the most prominent information from each layer, potentially emphasizing important features contributing to the overall task. The resulting node feature matrix () represents the enhanced post-embedding of the -layer GCN model. In this study, we consider a 2-layer GCN model, where the input (0) represents the temporal enhanced post-embedding output from the BiLSTM network and the output (2) represents the final enhanced post-embedding (H), capturing both temporal and multiplex network of social and meta-interaction of the conversation. 4.4. Multitask classification The evaluation framework tackles two tasks simultaneously: Moment of Change (MoC) detection and Mental Health (MH) classification. MoC detection focuses on while MH classification operates at the conversation level to determine the specific MH topics in discourse. To improve the MH classification task, we add a multi-head selfattention layer [ 37 ] over the enhanced post-embedding = softmax(b * H) = softmax( (H))

(3) where b is a Boolean vector to mask the non-target users’ posts from H. 4.5. Loss functions The evaluation framework considers the entire conversations to classify the Moments of Change (MoC) of the interaction across various aspects of the conversation. identifying mood shifts of the target user at the post level, () = max ︂( {︁ ︁( − 1/2− 1/2(− 1) ())︁}︁ )︂ (H), resulting in an attention-weighted encoded represen

tation (H). Mathematically, the classification tasks =1 (2)

can be defined as: 1 ∑︁ ∑︁ (︁ · (1 − ) · T(1) · log( ))︁ ℒ = − =1 =1 ℒ = − target user’s mood, it is essential to mask the posts of non-target users. To train the model for the MoC detection task, we apply the Focal Loss Function [ 38 ], originally designed for object detection tasks to address the imbalanced class distribution. We use the traditional categorical cross-entropy loss function (CE) for the MH classification task. The loss functions for each task can be mathematically defined as: posts. This method serves as the baseline model for evaluating the performance of the evaluation framework.

5. Results and discussion

5.1. Detection of Moment of change This section evaluates the performance of the considered baseline models on the ConversationMoC dataset.

Initially, we evaluate these models using two input scenarios: (i) using only the target user’s posts (TU ) and (ii) utilizing the entire conversation (All). Notably, as the TU ∑︁ T(2) · log( ) input lacks social interactions, models like GCN and BiL=1 STM+GCN are not evaluated in this context. Further, we (4) conduct an extensive analysis to understand the impact

of diferent layers within the multiplex network on the where and represent the number of posts and MoC downstream tasks. The experimental results for the MoC class labels in the conversation, represents the weight detection task, achieved through 10-fold cross-validation, factor for the MoC class , and represents the focusing are presented in Table 2. This table includes the mean parameter to control the rate at which the loss decreases F1-scores for each class (IE, IS, O) as well as the macro for well-classified examples. For the MoC classification F1-score, providing a comprehensive view of the overtask, T(1) represents the true MoC label one-hot vector all performance. From the table, it is observed that the of the target user post i in the conversation. While T(2) BiLSTM+GCN model consistently outperforms its stanrepresents the true label of the conversation MH topic. dalone counterparts. In particular, the BiLSTM+GCN models, when incorporating the multiplex graph input 4.6. Comparision of model variants with the Emotion, Sentiment, and Reply (ESR) layers, exhibit the highest macro F1-scores, achieving 0.422 in the The MoC and MH classification tasks can be evaluated single-task setup and 0.438 in the multitask setup. An as single or multitask setups. Moreover, the conversation intriguing observation is that the performance of specific dynamics can be encoded in both setups using a stan- models significantly deviates from the average in a few dalone BiLSTM model, GCN model, or a combination folds, leading to a standard deviation of approximately of both (BiLSTM+GCN). To assess the impact of con- ± 0.02. For a detailed view of these results, please refer versation context, we compare two input scenarios: (i) to the boxplot presented in Appendix Figure 6, which TU, which encompasses solely the target user’s sequence visualizes the F1-score performances of the multitask of posts, and (ii) All, which encompasses the sequence models across all folds. These findings underscore the efof posts interacting with the target user’s posts in the fectiveness and consistency of the proposed framework, conversation. Based on the input type, we evaluate the validating its superior performance in detecting MoC considered models (BiLSTM, GCN, BiLSTM+GCN) over across mental health-related tasks. the MoC dataset using the pretrained post-embedding The results are evident; incorporating an entire conver(discussed in Section 4.1). Hyperparameter details are in sation context notably improves the performance of MoC Appendix Section A.2. detection models compared to using only the target user posts. Furthermore, the multitask setup consistently out4.6.1. Heuristic model for MoC detection performs the single-task setup9. Delving into the performance across diferent classes reveals intriguing insights.

The GCN model, in the multitask setup, emerges as the best model, achieving an F1-score of 0.287 for the escalation (IE) class. On the other hand, for the switch (IS) class, the BiLSTM+GCN model achieves the best performance, with an F1-score of 0.169. The single-task BiLSTM model, which exclusively relies on the posts of the target user, achieves the highest F1-score of 0.906 for the No MoC (O) class. It suggests that the posts from the target user alone Inspired by the Circumplex model [ 31 ], we design a heuristic method for detecting Moments of Change (MoC) in the target user’s posts. We employ a pre-trained RoBERTa emotion classifier [ 34 ] to classify the target user’s posts. This model predicts four primary emotion classes – anger, sad, joy, and optimism. It assigns each class confidence score ( t). If a post doesn’t meet the minimum confidence threshold ( t >= 0.7) for any of the four emotions considered, it is labeled as neutral. Further, using the Valence-Arousal (VA) space, we heuristically assign the Moments of Change (MoC) in the target user’s 9A boxplot comparison of the models F1-score performances using categorical crossentropy loss function and Focal loss function is shown in Appendix Figure 6. contain more informative signals for the O class than the the conversation dynamics, ultimately culminating in context provided by the conversation. The heuristic MoC enhanced performance. In summary, the results in Table classification model also achieves an F1-score of 0.164 in 2 highlight the importance of using multiplex networks classifying the IS class, higher than any single-task mod- and emphasize the pivotal role played by the Reply netels. This underscores the efectiveness of the pre-trained work in MoC detection. Combining social interactions, RoBERTa-based emotion classifier. emotional expressions, and sentiment patterns provides a complete conversation view, allowing the model to 5.1.1. Graph multiplex layers analysis handle the tasks efectively.

To delve deeper into the impact of diferent layers within the multiplex network, we conducted a comprehensive 5.2. Mental health classification performance analysis of the BiLSTM+GCN model, as Figure 5 presents a bar chart illustrating the performance detailed in Table 2. The results reveal that the model of various models in classifying mental health (MH) disperforms better when leveraging the multiplex networks course. Rather than relying on traditional topic modthan relying on individual networks. Significantly, when eling techniques, we directly categorize the MH topics we examine the performance of the BiLSTM+GCN model discussed within the conversations using the models conacross the respective graphs, the Reply graph consistently sidered in this study. The evaluation includes single-task outperforms the Emotion and Sentiment graphs. This and multitask setups, using the categorical cross-entropy suggests that social interactions provide more useful in- loss function to train the MH classification task. As seen formation for the tasks we are interested in. In particular, in Figure 5, the performance is notably superior for the the Reply graph contains authentic, ground-truth data multitask models compared to their single-task counterof social interactions. In contrast, the Emotion and Senti- parts. In this study, the most notable performers among ment graphs are constructed based on the emotion and multitask models are the BiLSTM (All) and BiLSTM+GCN sentiment classification of each post using the pretrained (R), both achieving remarkable macro F1-scores of 0.85 RoBERTa classifier, which is susceptible to potential mis- and 0.84, respectively. These results substantiate that classifications, as evidenced by the performance of the incorporating conversation contextual information sigheuristic MoC classification model in handling IE and O nificantly enhances the accuracy of MH classification, classes. Moreover, when incorporating Reply and Senti- particularly when considering only the target user’s posts ment networks, the model’s performance improved even as the input data. This observation highlights the substanfurther, achieving the highest 0.470 F1-score. This in- tial contribution of conversation context information for dicates that the Reply network is practical in capturing enhancing the classification of mental health discourse. changes in the target user’s mood. The interplay be- Delving deeper into the performance across individtween users and the presence of emotionally charged ual MH topics, it becomes apparent that the BiLSTM (sentimental) conversations significantly impacts MoC model, incorporating All posts, excels in 8 MH classes, detection. By incorporating these additional layers, the while the BiLSTM+GCN (Reply) model leads in 7 MH model attains a more comprehensive understanding of classes (results detailed in Appendix Table 5). These re(a) Single-task (b) Multitask sults underscore the importance of conversation context information, with both models demonstrating robust performance across various MH topics. In summary, these ifndings emphasize the advantages of multitask models and highlight that integrating conversation context along with the Reply network significantly enhances the accuracy of MH classification within conversations. The BiLSTM+GCN (All) model emerges as a standout performer, achieving high performance in eight MH categories within this study.

6. Conclusion 7. Ethical Statement

Ethical approval for this study was obtained from the University of Southampton ethics board (submission reference ERGO/FEPS/64959.A1). The research involves the analysis of personal data sourced from the social media platform Reddit. To ensure compliance with ethical guidelines and regulations, we have adhered to the Reddit platform API’s terms and conditions, and our annotated dataset is shared with Reddit IDs only so other researchers can download the original Reddit posts and metadata directly from Reddit. During the annotation process, the annotators were informed about the potential risks of encountering disturbing content. They were encouraged to take regular breaks and time-outs from their annotation work to mitigate emotional overload.

Additionally, a clinically trained psychologist has been actively advising the team to provide expertise and guidance throughout the project. A comprehensive risk assessment has been conducted to identify and address any potential risks associated with this task. Our commitment to ethical considerations and the well-being of the annotators underscores our commitment to conducting responsible and sensitive research in the field of mental health analysis.

This study introduces a novel publicly accessible dataset (ConversationMoC) tailored to identify the Moments of Change (MoC) and classify Mental Health (MH) discourse within conversational settings. The importance of incorporating conversation information to identify MoC and classify MH discourse is investigated using a combination of BiLSTM and GCN models in single-task and multitask setups. The experimental results evidently show the significance of incorporating conversation information to identify MoC and classify MH discourse. Further, encoding the intricate social interactions, emotional dynamics, and sentiment patterns through multiplex network structure enhances classification performances. More specifically, the Reply network emphasizes the significance of 8. Limitations social interactions and user engagement. Additionally, when combined with the Sentiment and Emotion net- In this study, there are few limitations that warrant conworks, the classification performance further improves, sideration. Firstly, our findings are derived from a single underscoring the influence of emotional conversations Reddit dataset. While we envision the potential for our and overall sentiment. The multiplex networks repre- models to generalize well to analogous conversational sent an exciting new direction for future conversational datasets with a similar social context graph, we have analysis and mental health detection research. yet to conduct experiments on problem datasets beyond Reddit. This limitation arises due to the unavailability of publicly annotated datasets for MoC in this specific domain, underscoring the significance of our contribution in providing a new publicly accessible MoC dataset, ConversationMoC, for prospective research. Additionally, our study does not explore the performance of more recent and larger language models (LLMs) like OpenAI’s GPT3/4, Meta’s LLaMa, Stanford’s Alpaca, and Berkeley’s Gorilla models. While we anticipate potential improvements in performance by leveraging these advanced models, experimental validation of this hypothesis remains pending. Furthermore, from the perspective of the evaluation framework, several limitations and potential solutions to mitigate these challenges are highlighted: • Contextual Understanding in Short Conversations:

Acknowledging that short conversations with limited posts may pose challenges in contextual understanding, integrating LLMs can alleviate this issue by capturing a broader context. • Semantic Consistency in Dynamic Conversations:

Dynamic conversations with rapid emotional shifts due to longer conversations (e.g., 5 posts + 50 replies) present hurdles in maintaining semantic consistency. In this scenario, incorporating an additional attention layer into the framework could serve to weight the influence of diferent posts dynamically and replies within a conversation. Moreover, exploring the integration of guiding loss functions is suggested. These functions would guide the model to focus on the primary conversation topics and emotions, even amidst swift emotional changes. This combined approach could enhance the model’s understanding of key conversation topics, particularly if the conversation is full of changing emotions and dynamics.

9. Future work

Acknowledging the potential for the conversation multiplex network encoding framework to apply to various domains and recognizing the importance of testing it on diverse datasets, our current investigation faced limitations due to the scarcity of datasets with similar characteristics. In the future, we aim to expand our analysis to encompass a more comprehensive range of conversation datasets, thereby demonstrating the broader applicability of our framework beyond the scope of this specific domain. 10. Acknowledgement This work was supported by the Natural Environment Research Council (NE/S015604/1), the Economic and Social Research Council (ES/V011278/1) and the Engineering and Physical Sciences Research Council (EP/V00784X/1).

The authors acknowledge the use of the IRIDIS High Performance Computing Facility, and associated support services at the University of Southampton, in the completion of this work and the highly valuable insights into the mental health domain from Aynsley Bernard of Kooth Plc. ing, Knowledge and Information Systems (2024). 2098–2110.

doi:10.1007/s10115-023-02053-8. [21] Y. Pruksachatkun, S. R. Pendse, A. Sharma, Mo[10] L. G. Singh, A. Mitra, S. R. Singh, Sentiment analysis ments of change: Analyzing peer-based cognitive of tweets using heterogeneous multi-layer network support in online mental health forums, in: Prorepresentation and embedding, in: Proceedings ceedings of the 2019 CHI conference on human of the 2020 Conference on Empirical Methods in factors in computing systems, 2019, pp. 1–13. Natural Language Processing (EMNLP), 2020, pp. [22] A. Tsakalidis, J. Chim, I. M. Bilal, A. Zirikly, D. Atzil8932–8946. Slonim, F. Nanni, P. Resnik, M. Gaur, K. Roy, [11] L. Qin, Z. Li, W. Che, M. Ni, T. Liu, Co-gat: A co- B. Inkster, et al., Overview of the clpsych 2022 interactive graph attention network for joint dialog shared task: Capturing moments of change in lonact recognition and sentiment classification, in: gitudinal user posts, in: Proceedings of the Eighth Proceedings of the AAAI Conference on Artificial Workshop on Computational Linguistics and CliniIntelligence, 2021, pp. 13709–13717. cal Psychology, 2022, pp. 184–198. [12] D. Sheng, D. Wang, Y. Shen, H. Zheng, H. Liu, Sum- [23] A. Hills, A. Tsakalidis, F. Nanni, I. Zachos, M. Limarize before aggregate: A global-to-local heteroge- akata, Creation and evaluation of timelines for neous graph inference network for conversational longitudinal user posts, in: Proceedings of the 17th emotion recognition, in: Proceedings of the 28th Conference of the European Chapter of the AssociInternational Conference on Computational Lin- ation for Computational Linguistics, 2023, pp. 3773– guistics, 2020, pp. 4153–4163. 3786. [13] X. Huang, J. Zhang, D. Li, P. Li, Knowledge graph [24] U. Bayram, L. Benhiba, Emotionally-informed modembedding based question answering, in: Proceed- els for detecting moments of change and suicide ings of the twelfth ACM international conference risk levels in longitudinal social media data, in: on web search and data mining, 2019, pp. 105–113. Proceedings of the Eighth Workshop on Computa[14] Y. Zhang, H. Dai, Z. Kozareva, A. Smola, L. Song, tional Linguistics and Clinical Psychology, 2022, pp.

Variational reasoning for question answering with 219–225. knowledge graph, in: Proceedings of the AAAI [25] G. Coppersmith, M. Dredze, C. Harman, K. Hollingconference on artificial intelligence, 2018. shead, From adhd to sad: Analyzing the language [15] C. Gao, W. Lei, X. He, M. de Rijke, T.-S. Chua, of mental health on twitter through self-reported Advances and challenges in conversational recom- diagnoses, in: Proceedings of the 2nd workshop on mender systems: A survey, AI Open 2 (2021) 100– computational linguistics and clinical psychology: 126. from linguistic signal to clinical reality, 2015, pp. [16] Z. Fu, Y. Xian, Y. Zhu, S. Xu, Z. Li, G. De Melo, 1–10.

Y. Zhang, Hoops: Human-in-the-loop graph rea- [26] A. Benton, M. Mitchell, D. Hovy, Multitask learnsoning for conversational recommendation, in: Pro- ing for mental health conditions with limited social ceedings of the 44th International ACM SIGIR Con- media data, in: Proceedings of the 15th Conferference on Research and Development in Informa- ence of the European Chapter of the Association tion Retrieval, 2021, pp. 2415–2421. for Computational Linguistics: Volume 1, Long Pa[17] D. Wang, Z. Li, H. Zheng, Y. Shen, Integrating user pers, 2017, pp. 152–162. URL: https://aclanthology. history into heterogeneous graph for dialogue act org/E17-1015. recognition, in: Proceedings of the 28th Interna- [27] S. Ji, T. Zhang, L. Ansari, J. Fu, P. Tiwari, E. Cambria, tional Conference on Computational Linguistics, Mentalbert: Publicly available pretrained language 2020, pp. 4211–4221. models for mental healthcare, in: Proceedings of [18] H. Xu, Z. Yuan, K. Zhao, Y. Xu, J. Zou, K. Gao, Gar- the Thirteenth Language Resources and Evaluation net: A graph attention reasoning network for con- Conference, 2022, pp. 7184–7190. versation understanding, Knowledge-Based Sys- [28] S. Ji, Towards intention understanding in suicidal tems 240 (2022) 108055. risk assessment with natural language processing, [19] L. Yang, F. Wu, J. Gu, C. Wang, X. Cao, D. Jin, Y. Guo, in: Findings of the Association for Computational Graph attention topic modeling network, in: Pro- Linguistics: EMNLP 2022, 2022, pp. 4028–4038. ceedings of The Web Conference 2020, 2020, pp. [29] I. Lin, L. Njoo, A. Field, A. Sharma, K. Reinecke, 144–154. T. Althof, Y. Tsvetkov, Gendered mental health [20] M. De Choudhury, E. Kiciman, M. Dredze, G. Cop- stigma in masked language models, in: Proceedings persmith, M. Kumar, Discovering shifts to suicidal of the 2022 Conference on Empirical Methods in ideation from mental health content in social me- Natural Language Processing, Association for Comdia, in: Proceedings of the 2016 CHI conference putational Linguistics, 2022, pp. 2152–2170. URL: on human factors in computing systems, 2016, pp. https://aclanthology.org/2022.emnlp-main.139.

A. Appendix

A.1. 15 subreddit topics In this study, we collected data from 15 mental health subreddits encompassing a wide range of topics. The 15 subreddits are Eating Disorder (r/EDAnonymous), Addiction (r/addiction), Alcoholism (r/alcoholism), Attention Deficit Hyperactivity Disorder (ADHD) (r/adhd), Anxiety Hyperparameters

Optimizer Learning rate Training Epochs Batch size BiLSTM #Units Multihead attention layers

Pretrained model

FastText [32]

Sentence-BERT [ 33 ] * RoBERTa-base (emoji) * RoBERTa-base (emotion) [ 34 ] * RoBERTa-base (hate) * RoBERTa-base (irony) * RoBERTa-base (ofensive ) * RoBERTa-base (sentiment)

Value (r/anxiety), Autism (r/autism), Bipolar Disorder (r/BipolarReddit), Borderline Personality Disorder (BPD) (r/bpd), Depression (r/depression), Health Anxiety (r/healthanxiety), Loneliness (r/lonely), Post-Traumatic Stress Disorder (PTSD) (r/ptsd), Schizophrenia (r/schizophrenia), Social Anxiety (r/socialanxiety), and Suicide (r/SuicideWatch). Considering these diverse mental health topics, we aimed to capture a comprehensive picture of mental health discussions in online communities.

A.2. Hyperparemeters This study considers several hyperparameters to optimize the performance of the proposed model for detecting moments of change and identifying mental health topics in conversations. The detailed hyperparameter settings, including the dimensions of the output representations from pretrained models, are presented in Table 3. A.3. Moment of change classification Figure 6 presents boxplots representing the distribution of F1-scores for the moment of change (MoC) classification across three classes: IE (escalation), IS (switch), and O (No MoC), including the macro F1-score. Each boxplot represents a diferent model considered in this study, with the x-axis representing the models and the y-axis representing the F1-scores. The boxplots show the median (middle line), interquartile range (box), and range of the scores (whiskers), providing a visual representation of the performance distribution for MoC classification.

H TU All E S R ES ER SR ESR

Model

Heuristic classifier BiLSTM (TU) BiLSTM (All) BiLSTM+GCN (E) BiLSTM+GCN (S) BiLSTM+GCN (R) BiLSTM+GCN (ES) BiLSTM+GCN (ER) BiLSTM+GCN (SR)

Input type

Target user’s posts only Target user’s posts only Entire posts in a conversation Entire posts + Emotion graph Entire posts + Sentiment graph Entire posts + Reply graph

Entire posts + Emotion and Sentiment multiplex graph

Entire posts + Emotion and Reply multi

plex graph

Entire posts + Sentiment and Reply mul

tiplex graph

BiLSTM+GCN (ESR) Entire posts + Emotion, Sentiment, and Reply multiplex graph

A.4. Mental Health classification performing multitask models for each of the 15 individual mental health categories. The table showcases the efectiveness of these models in accurately classifying mental health categories, as indicated by their high F1-scores achieved through 10-fold cross-validation. )7 )6 ) ) ) ) ) ) )

1 6 6 3 1 4 1 s e ± ± (± (± (± ± ± ± ± m ( ( ( ( ( ( h its 99 24 49 32 43 49 07 21 37 lta 37 41 02 92 49 08 58 96 74 co t u u .4 .8 .2 .2 .9 .1 .4 .3 .2 e .5 .8 .5 .4 .7 .4 .5 .4 .4 a n m A 0 0 0 0 0 0 0 0 0 H 0 0 0 0 0 0 0 0 0 in se r e e r r y e s ) ) ) ) ) ) ) ) ) 3 9 5 9 7 4 6 3 4 n .1 .0 .1 .0 .1 .1 .1 .1 .1 irsseeop .(2820± .(4520± .(2450± .(3000± .447(0± .(3030± .(3310± .(3370± .(3330± lliscoohm .(4730± .769(0± .(3780± .(3650± .(7650± .(3020± .(4870± .(3830± .(3460± ftrsaom ,,)S–E rrseeSp D 0 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 s G( ,E o p N le R n itan .816 .847 .172 .603 .387 .783 .884 .293 .683 STD .423 .075 .192 .303 .806 .422 .563 .623 .072 TU( EB ,aS E 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 – s , ) a E n Me .)91 .)02 .)71 .)51 .)31 .)90 .)31 .)01 .)70 rsee iSLT .roF 0 0 0 0 0 0 0 0 0 ep B R

r + (± (± (± (± (± (± (± (± (± ) T d

n on E e i D s t

e a ) ) ) ) ) ) ) ) ) O r 7 6 9 0 2 0 4 7 1 p in S 0 0 0 0 0 0 0 0 0 B 0 0 0 0 0 0 0 0 0 e n + th r o Mi . s r e y a l t n e m . it s n h e p S a r d g n

a y l n p o e i R to d m n E a , g t

n n i e v im ah t n h n e p u t e R a rep l-2

a l C p l a G m s + a t x ) E ) S )

E ) ) R ) (E ) ) ( R R S )

S R N (E (S (E E N ( (

C C ) M U ) C N N N N N M ) ) N N N N N N N P en G y

C ( (T ll G C C +G C C C ( U ll C C C C C C C ts se ph la lse M(A +M+G +G M+G +G +G lse (T (A +G +G +G +G +G +G +G so rp ra lex odM iSLT STM iSLT STMSTM SLT TMTMTM od TMTM TMTMTM TMTMTMTM tpu l)re tgu litp

i S S S M S S S S S S S S S p l p u littsaku +ETRBB iL+ETRB +ETRBB iL+ETRB iL+ETRB +ETRBB iL+ETRB iL+ETRB iL+ETRB littsaku iL+ETRB iL+ETRB iL+ETRB iL+ETRB iL+ETRB iL+ETRB iL+ETRB iL+ETRB iL+ETRB i*enTh A(adn ie$nTh e+Thm M * B $ B B + B B B M B B B B B B B B B

) ) ) R h ll ) ) ) S R R S t

a E S R E E S E to tsa to rs ( ( ( ( ( ( (

e p c T w re a S d is in iL e

B n LE tss + e

T s D o R re O p E p e B e

r th re

M h e ) e t Mental Health (MH) classification task Performance (F1-score). Bold indicates top-performing models across individual MH categories and Macro-F1 scores. Mean results for 10-fold cross-validation were reported with standard deviations. SModels

IE SModels

R IS SModels

IE SModels

R IS H

ESR

All

ESR H

ESR

All

ESR H

All

ESR

[1]

Zhang ,

A. M.

Schoene ,

Ji ,

Ananiadou , Natural language processing applied to mental illness detection: a narrative review , NPJ digital medicine 5 ( 2022 ) 46 .

[2]

Naskar ,

S. R.

Singh ,

Kumar ,

Nandi , E. O. d. l. Rivaherrera, Emotion dynamics of public opinions on twitter , ACM Transactions on Information Systems (TOIS) 38 ( 2020 ) 1 - 24 .

[3]

Z. P.

Jiang ,

S. I.

Levitan ,

Zomick ,

Hirschberg , Detection of mental health from reddit via deep contextualized representations , in: Proceedings of the 11th International Workshop on Health Text Mining and Information Analysis , 2020 , pp. 147 - 156 .

[4]

Cohan ,

Desmet ,

Yates ,

Soldaini , S. MacAvaney, N. Goharian, Smhd: a large-scale resource for exploring online language usage for multiple mental health conditions , in: Proceedings of the 27th International Conference on Computational Linguistics , 2018 , pp. 1485 - 1497 .

[5]

Tsakalidis ,

Nanni ,

Hills ,

Chim ,

Song ,

Liakata , Identifying moments of change from longitudinal user text , in: Annual Meeting of the Association for Computational Linguistics , 2022 .

[6]

Azim ,

Singh ,

Middleton , Detecting moments of change and suicidal risks in longitudinal user texts using multi-task learning , in: Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology , 2022 , pp. 213 - 218 .

[7]

Ghosal ,

Majumder ,

Poria ,

Chhaya ,

Gelbukh , Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019 , pp. 154 - 164 .

[8]

Sawhney ,

Agarwal ,

A. T.

Neerkaje ,

Aletras ,

Nakov ,

Flek , Towards suicide ideation detection through online conversational context , in: Proceedings of the 45th international ACM SIGIR conference on research and development in information retrieval , 2022 , pp. 1716 - 1727 .

[9]

L. G.

Singh ,

S. R.

Singh , Sentiment analysis of tweets using text and graph multi-views learn-

[30] D. M. Low , L.

Rumker , T.

Talkar , J.

Torous , G. Cecchi, S. S.

Ghosh , Natural language processing reveals vulnerable mental health support groups and heightened health anxiety on reddit during covid19: Observational study , Journal of medical Internet research 22 ( 2020 ) e22635 .

[31]

J. A.

Russell , A circumplex model of afect ., Journal of personality and social psychology 39 ( 1980 ) 1161 .

[32]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, Enriching word vectors with subword information, Transactions of the association for computational linguistics 5 ( 2017 ) 135 - 146 .

[33]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019 , pp. 3982 - 3992 .

[34]

Barbieri ,

Camacho-Collados ,

L. E.

Anke , L. Neves, Tweeteval: Unified benchmark and comparative evaluation for tweet classification , in: Findings of the Association for Computational Linguistics: EMNLP 2020 , 2020 , pp. 1644 - 1650 .

[35]

Zhang ,

Zheng ,

Hu , M. Yang, Bidirectional long short-term memory networks for relation classification , in: Proceedings of the 29th Pacific Asia conference on language, information and computation , 2015 , pp. 73 - 78 .

[36]

Kawakami , Supervised sequence labelling with recurrent neural networks , Ph. D. thesis ( 2008 ).

[37]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).

[38] T.-Y. Lin , P.

Goyal , R.

Girshick , K.

He , P.

Dollár , Focal loss for dense object detection , in: Proceedings of the IEEE international conference on computer vision , 2017 , pp. 2980 - 2988 .

[39]

D. P.

Kingma ,

Ba , Adam: A method for stochastic optimization , in: 3rd International Conference on Learning Representations, ICLR 2015 , San Diego, CA, USA, May 7- 9 , 2015 , Conference Track Proceedings, 2015 . URL: http://arxiv.org/abs/1412.6980.