Introduction

Modeling Social Media Narratives about Caste-related News Stories

Prashanth Vijayaraghavan

pralav@media.mit.edu 1

Lavanya Vijayaraghavan

11ya.vijayaraghavan@gmail.com 0 0 Fremont , CA 08544 , USA 1 MIT Media Lab , Cambridge MA, 02139 , USA

Caste as a system of social stratification has created a structured culture of oppression in the Indian subcontinent. The rise of social media has paved the way for the democratization of voices providing space for all groups to express, discuss, debate, and form opinions on critical issues. Unfortunately, it also o↵ers a platform for closet or overtly casteist persons to perpetuate discrimination, spread hatred, and sustain casteism under the veil of creating good social media narratives. The overall goal of our work is to model social media narratives associated with caste-specific news stories. To this end, we first aggregate user-generated social media content (e.g., comments) about various caste-related news stories. Next, we analyze these aggregated contents to extract divergent value judgments representing di↵erent opinions associated with these news stories. Finally, our ongoing research will provide means to infer the value judgments for each user-generated content automatically, track divergent narratives related to the particular news story, and tackle casteist social media posts using counter-narratives generated by leveraging on the inferred value judgments.

Introduction

Reddi

N Cae-eaed

Te Cec Keord-based filtering

Cae-eaed S

Cafe BERT-based Classifier

Sca Meda Naae Cec

Deduplication, NesComments mapping

Vae Jdgee

Eac FrameNet Leical Units

CENSOR:

Cae-baed Naaie C

Prior research has primarily focused on studies related to endogamy, and their qualitative interviews and surveys [RMJ19], hate speech detection on social media [KJ18], analysis of caste in newsrooms [FBLM19], to list a few. However, there is a vast gap between its prevalence in our day-to-day life and sucient NLP e↵orts to understand and tackle casteism in social media. Given the increasing crimes against the disadvantaged caste sections of the society, this is generally reflected in the news reports despite the bias in newsrooms [FBLM19]. Moreover, several studies have indicated a tremendous growth in online news consumption, especially from social media. However, [WFE+16] observed that exposure to one-sided social media comments with one-sided opinions influenced participants’ opinion on issues, mainly when comments contained personal stories. Therefore, we deem it necessary to identify di↵erent divergent narratives emerging out of caste-related news stories and tackle casteist discourse with e↵ective counter-narratives.

In our work, we construct a corpus, Censor1, using a data processing pipeline that (a) aggregates caste-related news stories from mainstream news sources, (b) maps these news stories to social media comments (Reddit, in our case), and (b) extracts the value judgements representing di↵erent user opinions for the news story. Finally, we will apply di↵erent learning strategies to infer value judgments and e↵ectively generate counter-narratives to user-generated comments. Our contribution is three-fold: • An automated data collection pipeline for aggregating caste-related stories. • A cross-platform corpus, Censor, that contains characteristic narratives aggregated using extraction of value judgments from Reddit content related to caste-related stories. • Ongoing modeling approach to infer value judgments and generate counter-narratives for user-generated comments. 2

Dataset Collection

• Noisy Caste-related Text Collection: From traditional news media sources, we specifically lookup for English-language digital news sources and create a keyword-based search for caste. Keywords include broad caste (or varna) names and ocially recognized names used by media as explained in Section 1. In this work, we specifically focus on English-language news stories from media collections in India (both national, state, and local level as indicated in Media Cloud). Since the caste atrocities increased during the lockdown 1Short for CastE-related NarrativeS cORpus 2http://mediacloud.org 3https://pushshift.io/

Source:MediaCloud

Caste-Related News period related to the CoVID-19 pandemic 4, we collect news data over a two-year window, i.e., between 2019 and 2021, to analyze this pattern and characterize the nature of social media narratives during the same period. Given that Media Cloud permits boolean searches, we construct a general boolean query containing keywords related to: (a) general mention of castes or their categories (e.g., castes, Brahmins, Kshatriyas, Vaishyas, Shudras, Dalits, upper castes, lower castes), (b) government recognized group names and their abbreviations (e.g., Scheduled Castes, Scheduled Tribes, SC/ST, Other Backward Classes, OBCs), (c) common references to discriminative practices and forms of oppression (e.g., caste discrimination, untouchability, two-tumbler system, manual scavenging, honour killing, dishonour killing, caste murders, endogamy, intercaste marriages) and (d) discourse about legal provisions and armative action (e.g., reservations, quota, SC/ST act). Figure 2 shows the percentage of caste-related stories normalized weekly over two years from January 2019 to January 2021. We note the peaks refer to important caste-related stories that created a national-level debate. For example, the initial peaks in the first few weeks of Jan’19 are related to EWS quota bill5 (primarily for the so-called upper castes6). We aggregated ⇠ 180, 400 news stories using this approach. The aggregated news stories are not restricted to discriminative practices or atrocities but also cover general topics related to castes (e.g., constitutional amendment pertaining to quota bill). Next, we use scrapy7 to crawl and ingest article content from the URL of these aggregated news stories8. Since we used keyword-based search, news stories might not be centered around caste yet picked up by this aggregation method due to the mere occurrence of the term term ‘caste’. Hence, this collection could contain other stories that are not centered around castes. • Caste-related Story Classifier: Since we are interested in modeling social media narratives related to caste, the noisy data might draw other irrelevant discussions, which is not our research goal. Therefore, we clean up our data by training a simple binary classifier to identify caste-related news stories. Given that it is a pre-processing step, one of the authors annotated around 1,000 news articles to verify their caste specificity, i.e., tag them being caste-centric or not. Following [DCLT18], we fine-tune a Bert-based model and utilize the representation of the [CLS]-token to predict the label probabilities. With an F1 score of 89.6%, we apply this classifier to the noisy collection. This processing results in a total of ⇠ 138, 800 news stories. • Social Media Narrative Collection: To collect social media narratives around specific news stories, we use the PushShift API9 that searches for similar caste-relevant keywords as earlier. We choose only those Reddit posts containing references to URLs from our aggregated news stories dataset. Note that several 4https://www.newindianexpress.com/cities/delhi/2020/jul/07/atrocities-against-dalits-see-a-rise-2166477.html 5Sample news story– https://economictimes.indiatimes.com/news/politics-and-nation/opposition-questions-timing-of-quotabill/articleshow/67457766.cms

6Usage of terms like “upper” or “lower” relating to caste is meant to describe the existing hierarchy and how the discourse around caste is commonly understood. Such usages are not intended to reinforce that hierarchy.

7https://scrapy.org/ 8We will not make the crawled content data public. However, the URLs and other processed data like topics will be available for future researchers after we wrap up this work.

9https://pushshift.io/ To identify di↵erent high-level topics associated with caste discourse, we conduct cluster analysis using agglomerative hierarchical clustering using position-weighted Universal Sentence Encoder [CYK+18]. Figure 4 (b) displays the tSNE-visualization of sample stories and cluster labels tagged manually representative of the most common words present in the stories in the cluster. For the news stories in each cluster, we extract di↵erent expressions of value judgments from their associated Reddit comments. We aggregate, cluster, and classify them as either 10https://praw.readthedocs.io/en/latest/

Positive Negative Unclear

Micellae Dicimiai Maal Scaegig (b) Ie-Cae Maiage Cae Killig

Reservations Manual Scavenging Caste Killings Inter-caste Marriages

Discrimination 0 25 75

100 50 (a)

Reeai

As a part of our ongoing modeling process (see 4), we investigate ways to identify or generate value judgments by taking news stories and user comments as input. While any language model can be applied to generate value judgments, in this work, we use the transformer language model architecture introduced in [RWC+19] (GPT), which implements multiple transformer blocks of multi-headed scaled dot product attention and fullyconnected layers to encode input text [VSP+17]. Next, we will experiment with di↵erent controllable generation methods towards generating counter-narratives. We will explore both data and decoding-based methods for this purpose. The former pre-train a language model using the collected dataset, while the decoding-based method introduces a modification to the generation strategy without changes to model parameters. We employ special tokens by prepending them similar to [MSRC20, FG17] or condition on the generated value judgments by concatenation with special delimiter [SEP ] tokens in between. Our counter-narrative generations can gain from a domain adaptive pretraining strategy (DAPT) [GMS+20] in data-based setting. For the decoding-based method, we will leverage the plug-and-play language modeling work by Dathathri et al. [DML+19], which operates on a pre-trained language model by modifying the current and old hidden state to provide necessary control attribute. Though this process is computation-intensive, it can be more promising than the attribute conditioned method explained earlier. Our evaluation will utilize both automatic and manual generation metrics assessing the relevance of counter-narratives and the quality of generations. 4

Conclusion

In this work, we present a method to automatically aggregate caste-related narratives corpus across platforms – mainstream news sources and social media. Further, we intend to model the social media narratives by leveraging expressions of value judgments for generating counter-narratives for caste-related user-generated posts on social media. We plan to run both manual and automatic evaluation metrics on the generated counter-narratives. We believe that our work on modeling discourse around casteism is one of the early e↵orts to tackle such a critical issue and will springboard further research in this direction.

11https://github.com/cjhutto/vaderSentiment [BFL98]

Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1, pages 86–90, 1998.

[CYK+18] Daniel

Cer

, Yinfei Yang, Sheng-yi Kong , Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-C´ espedes , Steve Yuan,

Chris

Tar , et al. Universal sentence encoder . arXiv preprint arXiv:1803.11175 , 2018 .

[DCLT18]

Jacob

Devlin , Ming-Wei

Chang

Kenton

Lee ,

and Kristina

Toutanova . Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv:1810.04805 , 2018 .

[DML+19] Sumanth

Dathathri

, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approach to controlled text generation . arXiv preprint arXiv:1912.02164 , 2019 .

[FBLM19] Ant´onio Filipe Fonseca, Sohhom Bandyopadhyay, Jorge Lou¸c˜a, and Jaison Manjaly. Caste in the news: a computational analysis of indian newspapers . Social Media+ Society , 5 ( 4 ): 2056305119896057 , 2019 .

Jessica

Ficler and

Yoav

Goldberg . Controlling linguistic style aspects in neural language generation . In Proceedings of the Workshop on Stylistic Variation , pages 94 - 104 , Copenhagen, Denmark, September 2017 . Association for Computational Linguistics .

[FG17] [HG14] [KJ18] [RMJ19] [GMS +20] Suchin

Gururangan

, Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah

Smith.

Don't stop pretraining: Adapt language models to domains and tasks . arXiv preprint arXiv: 2004 .10964, 2020 .

Clayton J.

Hutto and

Eric

Gilbert . Vader: A parsimonious rule-based model for sentiment analysis of social media text . In Eytan Adar, Paul Resnick, Munmun De Choudhury, Bernie Hogan, and Alice H. Oh, editors, ICWSM. The AAAI Press , 2014 .

Satyajit

Kamble and

Aditya

Joshi . Hate speech detection from code-mixed hindi-english tweets using deep learning models . arXiv preprint arXiv:1811.05145 , 2018 .

[MSRC20] Xinyao Ma, Maarten Sap, Hannah Rashkin, and

Yejin

Choi . Powertransformer: Unsupervised controllable revision for biased language correction . arXiv preprint arXiv: 2010 .13816, 2020 .

Ashwin

Rajadesingan , Ramaswami Mahalingam, and David Jurgens. Smart, responsible, and upper caste only: Measuring caste attitudes through large-scale analysis of matrimonial profiles . In Proceedings of the International AAAI Conference on Web and Social Media , volume 13 , pages 393 - 404 , 2019 .

[RWC+19] Alec

Radford

, Je↵rey Wu, Rewon Child, David Luan,

Dario

Amodei , and

Ilya

Sutskever . Language models are unsupervised multitask learners . OpenAI blog , 1 ( 8 ): 9 , 2019 .

[VR21]

Prashanth

Vijayaraghavan and

Deb

Roy . Modeling human motives and emotions from personal narratives using external knowledge and entity tracking . In Proceedings of The Web Conference 2021 , 2021 .

[WFE+16] Holly

O Witteman

, Angela Fagerlin, Nicole Exe, Marie-Eve Trottier , and Brian J Zikmund-Fisher. One-sided social media comments influenced opinions and intentions about home birth: An experimental study . Health A↵airs , 35 ( 4 ): 726 - 733 , 2016 .