<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling Social Media Narratives about Caste-related News Stories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prashanth Vijayaraghavan</string-name>
          <email>pralav@media.mit.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lavanya Vijayaraghavan</string-name>
          <email>11ya.vijayaraghavan@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fremont</institution>
          ,
          <addr-line>CA 08544</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MIT Media Lab</institution>
          ,
          <addr-line>Cambridge MA, 02139</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Caste as a system of social stratification has created a structured culture of oppression in the Indian subcontinent. The rise of social media has paved the way for the democratization of voices providing space for all groups to express, discuss, debate, and form opinions on critical issues. Unfortunately, it also o↵ers a platform for closet or overtly casteist persons to perpetuate discrimination, spread hatred, and sustain casteism under the veil of creating good social media narratives. The overall goal of our work is to model social media narratives associated with caste-specific news stories. To this end, we first aggregate user-generated social media content (e.g., comments) about various caste-related news stories. Next, we analyze these aggregated contents to extract divergent value judgments representing di↵erent opinions associated with these news stories. Finally, our ongoing research will provide means to infer the value judgments for each user-generated content automatically, track divergent narratives related to the particular news story, and tackle casteist social media posts using counter-narratives generated by leveraging on the inferred value judgments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Copyright © by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>Reddi</p>
      <p>N Cae-eaed</p>
      <p>Te Cec
Keord-based filtering</p>
      <p>Cae-eaed S</p>
      <p>Cafe
BERT-based Classifier</p>
      <p>Sca Meda
Naae Cec</p>
      <p>Deduplication,
NesComments mapping</p>
      <p>Vae Jdgee</p>
      <p>Eac
FrameNet Leical Units</p>
      <p>CENSOR:</p>
      <p>Cae-baed
Naaie C</p>
      <p>Prior research has primarily focused on studies related to endogamy, and their qualitative interviews and
surveys [RMJ19], hate speech detection on social media [KJ18], analysis of caste in newsrooms [FBLM19], to
list a few. However, there is a vast gap between its prevalence in our day-to-day life and sucient NLP e↵orts
to understand and tackle casteism in social media. Given the increasing crimes against the disadvantaged caste
sections of the society, this is generally reflected in the news reports despite the bias in newsrooms [FBLM19].
Moreover, several studies have indicated a tremendous growth in online news consumption, especially from social
media. However, [WFE+16] observed that exposure to one-sided social media comments with one-sided opinions
influenced participants’ opinion on issues, mainly when comments contained personal stories. Therefore, we
deem it necessary to identify di↵erent divergent narratives emerging out of caste-related news stories and tackle
casteist discourse with e↵ective counter-narratives.</p>
      <p>In our work, we construct a corpus, Censor1, using a data processing pipeline that (a) aggregates caste-related
news stories from mainstream news sources, (b) maps these news stories to social media comments (Reddit, in
our case), and (b) extracts the value judgements representing di↵erent user opinions for the news story. Finally,
we will apply di↵erent learning strategies to infer value judgments and e↵ectively generate counter-narratives to
user-generated comments. Our contribution is three-fold:
• An automated data collection pipeline for aggregating caste-related stories.
• A cross-platform corpus, Censor, that contains characteristic narratives aggregated using extraction of
value judgments from Reddit content related to caste-related stories.
• Ongoing modeling approach to infer value judgments and generate counter-narratives for user-generated
comments.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Dataset Collection</title>
      <p>• Noisy Caste-related Text Collection: From traditional news media sources, we specifically lookup for
English-language digital news sources and create a keyword-based search for caste. Keywords include broad
caste (or varna) names and ocially recognized names used by media as explained in Section 1. In this
work, we specifically focus on English-language news stories from media collections in India (both national,
state, and local level as indicated in Media Cloud). Since the caste atrocities increased during the lockdown
1Short for CastE-related NarrativeS cORpus
2http://mediacloud.org
3https://pushshift.io/</p>
      <p>Source:MediaCloud</p>
      <p>Caste-Related News
period related to the CoVID-19 pandemic 4, we collect news data over a two-year window, i.e., between 2019
and 2021, to analyze this pattern and characterize the nature of social media narratives during the same
period. Given that Media Cloud permits boolean searches, we construct a general boolean query containing
keywords related to: (a) general mention of castes or their categories (e.g., castes, Brahmins, Kshatriyas,
Vaishyas, Shudras, Dalits, upper castes, lower castes), (b) government recognized group names and their
abbreviations (e.g., Scheduled Castes, Scheduled Tribes, SC/ST, Other Backward Classes, OBCs), (c)
common references to discriminative practices and forms of oppression (e.g., caste discrimination, untouchability,
two-tumbler system, manual scavenging, honour killing, dishonour killing, caste murders, endogamy,
intercaste marriages) and (d) discourse about legal provisions and armative action (e.g., reservations, quota,
SC/ST act). Figure 2 shows the percentage of caste-related stories normalized weekly over two years from
January 2019 to January 2021. We note the peaks refer to important caste-related stories that created a
national-level debate. For example, the initial peaks in the first few weeks of Jan’19 are related to EWS
quota bill5 (primarily for the so-called upper castes6). We aggregated ⇠ 180, 400 news stories using this
approach. The aggregated news stories are not restricted to discriminative practices or atrocities but also
cover general topics related to castes (e.g., constitutional amendment pertaining to quota bill). Next, we use
scrapy7 to crawl and ingest article content from the URL of these aggregated news stories8. Since we used
keyword-based search, news stories might not be centered around caste yet picked up by this aggregation
method due to the mere occurrence of the term term ‘caste’. Hence, this collection could contain other
stories that are not centered around castes.
• Caste-related Story Classifier: Since we are interested in modeling social media narratives related to
caste, the noisy data might draw other irrelevant discussions, which is not our research goal. Therefore, we
clean up our data by training a simple binary classifier to identify caste-related news stories. Given that
it is a pre-processing step, one of the authors annotated around 1,000 news articles to verify their caste
specificity, i.e., tag them being caste-centric or not. Following [DCLT18], we fine-tune a Bert-based model
and utilize the representation of the [CLS]-token to predict the label probabilities. With an F1 score of
89.6%, we apply this classifier to the noisy collection. This processing results in a total of ⇠ 138, 800 news
stories.
• Social Media Narrative Collection: To collect social media narratives around specific news stories, we
use the PushShift API9 that searches for similar caste-relevant keywords as earlier. We choose only those
Reddit posts containing references to URLs from our aggregated news stories dataset. Note that several
4https://www.newindianexpress.com/cities/delhi/2020/jul/07/atrocities-against-dalits-see-a-rise-2166477.html
5Sample news story–
https://economictimes.indiatimes.com/news/politics-and-nation/opposition-questions-timing-of-quotabill/articleshow/67457766.cms</p>
      <p>6Usage of terms like “upper” or “lower” relating to caste is meant to describe the existing hierarchy and how the discourse around
caste is commonly understood. Such usages are not intended to reinforce that hierarchy.</p>
      <p>7https://scrapy.org/
8We will not make the crawled content data public. However, the URLs and other processed data like topics will be available for
future researchers after we wrap up this work.</p>
      <p>9https://pushshift.io/
To identify di↵erent high-level topics associated with caste discourse, we conduct cluster analysis using
agglomerative hierarchical clustering using position-weighted Universal Sentence Encoder [CYK+18]. Figure 4 (b) displays
the tSNE-visualization of sample stories and cluster labels tagged manually representative of the most common
words present in the stories in the cluster. For the news stories in each cluster, we extract di↵erent expressions
of value judgments from their associated Reddit comments. We aggregate, cluster, and classify them as either
10https://praw.readthedocs.io/en/latest/</p>
      <sec id="sec-2-1">
        <title>Positive</title>
      </sec>
      <sec id="sec-2-2">
        <title>Negative</title>
      </sec>
      <sec id="sec-2-3">
        <title>Unclear</title>
        <p>Micellae
Dicimiai
Maal
Scaegig
(b)
Ie-Cae
Maiage
Cae
Killig</p>
      </sec>
      <sec id="sec-2-4">
        <title>Reservations</title>
      </sec>
      <sec id="sec-2-5">
        <title>Manual Scavenging</title>
      </sec>
      <sec id="sec-2-6">
        <title>Caste Killings</title>
      </sec>
      <sec id="sec-2-7">
        <title>Inter-caste Marriages</title>
        <p>Discrimination
0
25
75</p>
        <p>100
50
(a)</p>
        <p>Reeai</p>
        <p>As a part of our ongoing modeling process (see 4), we investigate ways to identify or generate value judgments
by taking news stories and user comments as input. While any language model can be applied to generate
value judgments, in this work, we use the transformer language model architecture introduced in [RWC+19]
(GPT), which implements multiple transformer blocks of multi-headed scaled dot product attention and
fullyconnected layers to encode input text [VSP+17]. Next, we will experiment with di↵erent controllable generation
methods towards generating counter-narratives. We will explore both data and decoding-based methods for
this purpose. The former pre-train a language model using the collected dataset, while the decoding-based
method introduces a modification to the generation strategy without changes to model parameters. We employ
special tokens by prepending them similar to [MSRC20, FG17] or condition on the generated value judgments
by concatenation with special delimiter [SEP ] tokens in between. Our counter-narrative generations can gain
from a domain adaptive pretraining strategy (DAPT) [GMS+20] in data-based setting. For the decoding-based
method, we will leverage the plug-and-play language modeling work by Dathathri et al. [DML+19], which
operates on a pre-trained language model by modifying the current and old hidden state to provide necessary
control attribute. Though this process is computation-intensive, it can be more promising than the attribute
conditioned method explained earlier. Our evaluation will utilize both automatic and manual generation metrics
assessing the relevance of counter-narratives and the quality of generations.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this work, we present a method to automatically aggregate caste-related narratives corpus across platforms –
mainstream news sources and social media. Further, we intend to model the social media narratives by leveraging
expressions of value judgments for generating counter-narratives for caste-related user-generated posts on social
media. We plan to run both manual and automatic evaluation metrics on the generated counter-narratives. We
believe that our work on modeling discourse around casteism is one of the early e↵orts to tackle such a critical
issue and will springboard further research in this direction.</p>
      <p>11https://github.com/cjhutto/vaderSentiment
[BFL98]</p>
      <p>Collin F Baker, Charles J Fillmore, and John B Lowe. The berkeley framenet project. In 36th
Annual Meeting of the Association for Computational Linguistics and 17th International Conference
on Computational Linguistics, Volume 1, pages 86–90, 1998.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [CYK+18]
          <string-name>
            <surname>Daniel</surname>
            <given-names>Cer</given-names>
          </string-name>
          , Yinfei Yang,
          <string-name>
            <surname>Sheng-yi Kong</surname>
          </string-name>
          , Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-C´
          <article-title>espedes</article-title>
          , Steve Yuan,
          <string-name>
            <given-names>Chris</given-names>
            <surname>Tar</surname>
          </string-name>
          , et al.
          <article-title>Universal sentence encoder</article-title>
          .
          <source>arXiv preprint arXiv:1803.11175</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [DCLT18]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>arXiv preprint arXiv:1810.04805</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [DML+19]
          <string-name>
            <surname>Sumanth</surname>
            <given-names>Dathathri</given-names>
          </string-name>
          , Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu.
          <article-title>Plug and play language models: A simple approach to controlled text generation</article-title>
          .
          <source>arXiv preprint arXiv:1912.02164</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [FBLM19]
          <article-title>Ant´onio Filipe Fonseca, Sohhom Bandyopadhyay, Jorge Lou¸c˜a, and Jaison Manjaly. Caste in the news: a computational analysis of indian newspapers</article-title>
          .
          <source>Social Media+ Society</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <fpage>2056305119896057</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Ficler</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <article-title>Controlling linguistic style aspects in neural language generation</article-title>
          .
          <source>In Proceedings of the Workshop on Stylistic Variation</source>
          , pages
          <fpage>94</fpage>
          -
          <lpage>104</lpage>
          , Copenhagen, Denmark,
          <year>September 2017</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[FG17] [HG14] [KJ18] [RMJ19] [GMS</source>
          +20]
          <string-name>
            <surname>Suchin</surname>
            <given-names>Gururangan</given-names>
          </string-name>
          , Ana Marasovi´c, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Don't stop pretraining: Adapt language models to domains and tasks</article-title>
          . arXiv preprint arXiv:
          <year>2004</year>
          .10964,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Clayton J.</given-names>
            <surname>Hutto</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Gilbert</surname>
          </string-name>
          .
          <article-title>Vader: A parsimonious rule-based model for sentiment analysis of social media text</article-title>
          . In Eytan Adar, Paul Resnick, Munmun De Choudhury, Bernie Hogan, and Alice H. Oh, editors,
          <source>ICWSM. The AAAI Press</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Satyajit</given-names>
            <surname>Kamble</surname>
          </string-name>
          and
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Joshi</surname>
          </string-name>
          .
          <article-title>Hate speech detection from code-mixed hindi-english tweets using deep learning models</article-title>
          .
          <source>arXiv preprint arXiv:1811.05145</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [MSRC20] Xinyao Ma, Maarten Sap, Hannah Rashkin, and
          <string-name>
            <given-names>Yejin</given-names>
            <surname>Choi</surname>
          </string-name>
          . Powertransformer:
          <article-title>Unsupervised controllable revision for biased language correction</article-title>
          . arXiv preprint arXiv:
          <year>2010</year>
          .13816,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Ashwin</given-names>
            <surname>Rajadesingan</surname>
          </string-name>
          , Ramaswami Mahalingam, and David Jurgens.
          <article-title>Smart, responsible, and upper caste only: Measuring caste attitudes through large-scale analysis of matrimonial profiles</article-title>
          .
          <source>In Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>13</volume>
          , pages
          <fpage>393</fpage>
          -
          <lpage>404</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [RWC+19]
          <string-name>
            <surname>Alec</surname>
            <given-names>Radford</given-names>
          </string-name>
          , Je↵rey Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
          <source>OpenAI blog</source>
          ,
          <volume>1</volume>
          (
          <issue>8</issue>
          ):
          <fpage>9</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [VR21]
          <string-name>
            <given-names>Prashanth</given-names>
            <surname>Vijayaraghavan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Deb</given-names>
            <surname>Roy</surname>
          </string-name>
          .
          <article-title>Modeling human motives and emotions from personal narratives using external knowledge and entity tracking</article-title>
          .
          <source>In Proceedings of The Web Conference</source>
          <year>2021</year>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [WFE+16]
          <string-name>
            <surname>Holly</surname>
            <given-names>O Witteman</given-names>
          </string-name>
          , Angela Fagerlin, Nicole Exe,
          <string-name>
            <surname>Marie-Eve Trottier</surname>
          </string-name>
          , and
          <string-name>
            <surname>Brian</surname>
          </string-name>
          J Zikmund-Fisher.
          <article-title>One-sided social media comments influenced opinions and intentions about home birth: An experimental study</article-title>
          .
          <source>Health A↵airs</source>
          ,
          <volume>35</volume>
          (
          <issue>4</issue>
          ):
          <fpage>726</fpage>
          -
          <lpage>733</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>