<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Building a corpus on Eating Disorders from TikTok: challenges and opportunities</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Melissa Donati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ludovica Polidori</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paola Vernillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gloria Gagliardi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alma Mater Studiorum - University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present two synchronic corpora of Eating Disorders (ED) related discourse on Social Media. PAC (i.e., ProAna/Anorexia Corpus) and RAC (i.e., Recovery from Ana/Anorexia Corpus) resources focus on the contents posted on TikTok, respectively, by communities promoting anorectic behavior and users sharing experiences concerning the process of recovery from their ED. We report on the corpus statistics and creation process, focusing specifically on the methodological issues raised by this novel Social Media platform.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Eating Disorders</kwd>
        <kwd>Corpus Linguistics</kwd>
        <kwd>TikTok</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        further compromised by the inhomogeneous
representation of linguistic data in the literature, where the majority
It was only 20 years ago that one of the darkest sides of of studies have been dedicated to the linguistic profiling
Eating Disorders (ED) was revealed through the prolifer- of ED-afected individuals in a Germanic language
(Enation of websites, blogs, and social networks, in which a glish, German, Norvegian) [18]. This paper represents
growing number of adolescents and young adults started a small step towards the reversal of this tendency but a
sharing information about their eating experiences with crucial part of two larger projects (Metaphan1 and RaAM
like-minded users. Among these pro-ED communities, project 20222) aiming at identifying, by the adoption of
researchers and clinicians showed particular concern for diferent NLP techniques and tools, potential lexical and
pro-Ana (i.e., “pro-anorexia”) groups, i.e., web-based com- semantic patterns in anorectic individuals. To this end, in
munities of anorexic (or aspiring anorexic) individuals the current research, we show the data collection process
engaged in the promotion of their Eating Disorder [1]. (i.e. oral and written productions) from ED
communiInterestingly, one of the most horrific and dangerous as- ties on TikTok, currently representing the most widely
pects of pro-Ana groups is that Anorexia Nervosa (AN) used social media among young people and adolescents,
is not presented as a psychiatric disorder associated with namely the population groups at greater risk for EDs. In
pathological body image dissatisfaction [2], but more as the following paragraphs, we give a brief overview of the
a way of living with its own rules and rituals to be re- literature on the topic (Section 2), then we describe the
spected. While over the last years, much has been done to process of creating the corpus and discuss the
methodprevent the circulation of pro-ED content on social media ological issues that were met (Section 3) and to conclude
(e.g., TikTok’s adoption of measures to obscure harmful we provide few insights for future works (Section 4).
contents: [3]), a new but specular phenomenon recently
took the toll, that is, the spread of pro-recovery accounts
of individuals who are in the process of healing from an 2. Related Works
ED and are willing to share their eating experience to
help other online users [4]. From a linguistic perspective, In recent years, we have witnessed exponential growth
research on ED has been very limited and became an in the use of Social Media (SM), especially by adolescents
object of study only in recent years [5, 6, 7, 8, 9, 10] as and young people. The community-building nature and
opposed to other psychopathologies, such as schizophre- the interactive dynamics of these platforms, as well as
nia [11, 12], personality disorder [13], and depression the less direct way of communicating, encourage users to
[
        <xref ref-type="bibr" rid="ref5">14, 15, 16, 17</xref>
        ]. This already problematic picture has been openly discuss a wide variety of topics [19]. In turn, this
makes available huge amount of data that can be used
for diferent purposes (e.g. extract actionable patterns,
form conclusions about users, conduct research, etc.).
      </p>
      <p>For this reason, Social Media Mining (SMM), i.e., the
process of extracting big data from SM, now constitutes</p>
      <sec id="sec-1-1">
        <title>1https://site.unibo.it/metaphan/en</title>
        <p>2https://site.unibo.it/metaphan/en/
connected-research-activities
a well-established methodology to collect large samples supporters of anorectic behaviors (for the English PAC
of data in diferent research areas [ 20]. This approach corpus); ii) witnesses and motivators for the recovery
has proved particularly fruitful for collecting data on process (for the Italian RAC corpus). Such profiles were
EDs as people sufering from these disorders seem to identified based on the linguistic and non-linguistic (i.e.,
overcome the self-protective nature of their ED to engage emojis) information present in their profile bio. The
sein ED-related discourse with online users sharing similar lection criteria will be presented in Section 3.1, prior to
experiences [21]. Indeed, in the last decade, many studies the description of the data collection process and the
have used diferent SM platforms as a source of data to discussion of the related issues that were encountered.
analyze EDs [22, 23, 24, 25, 21, 26, 27, 28, 29]. However, Before getting further into the methodology, it is
necthe state-of-the-art on ED-discourse on SM currently essary to make an ethical consideration concerning the
presents two main limitations: i) the majority of the collection of data from SM. Broadly speaking, SM posts
analysis was carried out on small datasets built ad-hoc that are publicly accessible are treated as belonging to
for the purpose of the work (with the only exception of the public domain, therefore, according to common
prac[30]), and ii) they mostly focused on the English language. tice, consent from the creators is not deemed necessary
As a matter of fact, in the Italian framework there have to download such data. This is strengthened by the fact
been very little research on the representation of EDs that, upon registration, TikTok asks its users to consent
on SM, and that little was mostly focused on Anorexia- to a set of terms of service that make the data available
Nervosa and did not target EDs in general [31, 32, 33]. for access to third parties [35]. In addition, when
creating and managing their accounts and contents users
can decide to make them publicly accessible or private
3. Corpus Creation: (i.e. only viewable by accepted followers); at any time,
Methodological Issues they can also restrict access to some of their contents
through privacy settings and choose whether to make
them downloadable. For the above reasons, given that
for the purpose of this work only public and
downloadable data was analysed, we did not seek users’ consent to
collect the posts. In compliance with similar SM analysis
[26], no reference to any identifying information, such
as usernames, will be made.</p>
        <p>Against this background and intending to fill this gap, we
created a collection of English and Italian ED-related data
that could be used for diferent types of research (from
purely linguistic and content analyses that could help
pinpointing the features and characteristics of ED-related
discourse, to various computational techniques that could
be used to implement systems of automatic detection
of ED-related content on SM). We selected TikTok as a
source of data as it currently represents the most widely
used SM, especially among young people and adolescents,
namely the at-risk population for EDs [34].</p>
        <p>To achieve this goal, we first needed to define the
nature and characteristics of the corpus itself. As far as
the linguistic features are concerned, our corpus is
specialized (i.e., is focused on the topic of EDs discourse
on TikTok), synchronic (i.e., refers to a specific point in
time that is the moment the data were downloaded), and
targets both written and spoken language (TikTok videos
contain spoken and/or written text). We did not set a
priori a target dimension to be reached, because this feature
is totally dependent upon the possibility of extracting the
data automatically (Section 3.1). Conversely, following
the common practice in the domain of SMM, we assumed
that ‘there is no data like more data’ and intended to
download as many videos as possible. To maximize the
corpus representativity, we tried to balance the sample
with respect to the types of videos being collected but
we could not do so concerning the users’ gender, because
for both corpora the vast majority of profiles were of
female individuals (see Section 3.3 for more details). The
target population consisted of those profiles that
identify themselves in one of the two following categories: i)</p>
        <sec id="sec-1-1-1">
          <title>3.1. Data Collection</title>
          <p>As explained above, the selection criteria adopted to
identify the target profiles was based on the information
present in the profiles’ bio. However, to track the
target profiles, we needed to start from a list of ED-related
hashtags that could lead us to such profiles via a
keywordbased search. The hashtags that were used herein were
generated both by brainstorming and by exploring the
platform for a couple of weeks, noting down the most
popular trends and the most widely used hashtags (see
Table 1 for an overview). Following this hashtag-driven
search, we noticed that there was very little -if any-
proAna content produced in Italian, that is why for this type
of ED-related content we decided to collect a small
sample of English data. On the other hand, we found quite
some profiles representing the ED-recovery community.</p>
          <p>Among these profiles, we selected those having at least
10k followers (some of them exceed 2M followers) and at
least 10 ED-related posts, so that we could maximize the
chance of gathering interesting and relevant linguistic
information. We then used the ED-related hashtags to
conduct a within-profile research to select only the
EDrelated videos in each profile in order to extract them.</p>
          <p>At this point, the next step consisted of extracting the</p>
        </sec>
      </sec>
      <sec id="sec-1-2">
        <title>Organizing the videos into 4 categories was particularly</title>
        <p>useful for the transcription phase as it allowed to adopt
Pro-Ana Pro-Recovery diferent strategies and techniques based on the input
hashtags hashtags characteristics. As for the downloading phase, although
(#ww3eiigghhttll0ossss) (#ddccaarr3eccoovveerryy) we intended to automatize the transcription process as
#unhealthyweightloss #dca 4 #dcaitalia much as possible, the high complexity of the data has, in
(+ lexical variations) #fiocchettolilla 5 some cases, made human intervention necessary.
#kpop3 #dcafighting For speech-only and playback videos automatic
transcription was performed using the Google Web Speech
API, which is easily accessible through the
SpeechRecognition Library [36]. To assess the quality of the
autoidentified ED-related videos from the selected profiles. matic transcription, a random sample of videos (n=10)
For the sake of time and eficiency, we wanted to down- for each category was extracted, transcribed manually
load the data automatically. However, diferently from and then compared with the machine-based
transcripother popular SM, TikTok has not yet released any oficial tion. For speech-only videos, a high agreement score
API that can be used by researchers and developers to was obtained between human and machine
transcripautomate the process of accessing and extracting the data. tion (&gt;90%) which confirmed the viability of the method
In addition, even if unoficial APIs exist, they get outdated adopted. Conversely, playback videos emerged as more
almost immediately after their release because TikTok problematic, thus manual correction was needed because
is constantly updating the anti-bot system preventing both singing and the music accompaniment adversely
automatic access from the same IP. To get around this, impacted on intelligibility.
we looked for a reliable and cost-efective proxy provider Automatic transcription was also attempted for
textfor TikTok scraping, but we could not find any viable only videos by means of Optical Character Recognition
solution. (OCR) using the Tesseract OCR engine [37], but we
ob</p>
        <p>Therefore we decided to proceed with the manual tained poor results due to the high visual complexity of
downloading of the data. The main drawback of this way the input data, more specifically to the extreme variability
of proceeding is that due to time and resource constraints of font type, size, and color, the lack of adequate contrast
we could not collect a very large number of videos (see with the background, the non-hierarchical spatial
organiTable 2). On the bright side, however, the manual down- zation of texts, and the presence of non-textual graphical
loading allowed us to i) enhance the content filtering elements (e.g., lexical variations of words, where letters
process and ii) notice that TikTok videos have diferent are substituted by numbers or emojis to prevent the
platformatting styles that might be worth distinguishing not form’s censorship and filtering system from blocking the
only to ease the ensuing transcription process but also to content as potentially harmful, e.g., ‘starving’ written
conduct separate content analysis and compare the difer- replacing star with the corresponding emojis, or
‘disorent results. Based on our observations about the diferent der’ written as ‘d1s0rder’). The same issue, boosted to
formatting styles, we grouped the TikTok videos into 4 the maximum, was observed with mixed videos, where
subcorpora: 1)Speech-only videos: in which the user was speech, music, and written text were mingled. Therefore,
talking in the absence of background music and/or writ- for these two categories of videos, we could only perform
ten text; 2) Playback: in which the user lip-sync over a the transcription manually.
song or an extract from a movie or tv shows; 3) Text-only: We reported below, as an example of the type of
EDin which there is neither background music nor the users related content that was selected, the transcription of
themselves speaking, but only written text superposed on two videos, one for each of the two datasets.
the video; and 4) Mixed: in which the above-mentioned
features are present in various combinations. [from RAC]</p>
        <p>3K-pop (for Korean-pop) is a popular genre of music originating
from South Korea that has been hugely influential in the ‘diet scene’
because young people want to look like their favourite K-pop stars
that are known for their extreme diets, indeed many young artists
have left behind the K-pop world in order to focus on eating disorder
treatment.</p>
        <p>4Disturbo del Comportamento Alimentare (Eating Disorder).</p>
        <p>5The Lilac Ribbon is the oficial international symbol against
Eating Disorders.
"questo video è davvero davvero dificile da
registrare per me ma lo faccio perché voglio
condividere tutta la mia vita con voi e voglio aiutare
delle persone che si trovano nella mia stessa
situazione parlando del mio problema dovete
sapere che io sono stata prima anoressica sono
arrivata a pesare 36 kg e vi parlerò poi te la
causa scatenante poi riscoperto il cibo ho
iniziato ad abbufarmi in una maniera assurda a
sentirmi in colpa e quindi poi a vomitare questa
si chiama bulimia ovviamente alternavo
momenti digiuno quindi magari non mangio
proprio per giorni a momenti in cui il tuo corpo
ha bisogno di cibo e quindi ti abbufi e mangi
qualsiasi cosa volevo solo dirvi che ieri è
successa un’altra volta il fatto è che io me lo vedo
subito in faccia cioè mi vedo 10 volte più grossa
e mi sento davvero super gonfia che senti ma
sono riuscita a non vomitare perché io sono più
forte sono con tutte voi6"
[from PAC]
"i’m *** i’m a new member stats starter weight
140.1 ibs goal weight 100 ibs ultimate goal
weight 90 ibs for now i binge eat when i’m bored
so i gained a lot of weight in the past months
i’m trying to limit myself on eating i am
currently 4’10 and i’m overweight for my height
age i listen to subliminal and trying to workout
also i hate exercising but i realized it is healthy
for me and my body 33"</p>
        <sec id="sec-1-2-1">
          <title>3.3. Corpus Statistics</title>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>In Table 2, we reported an overview of the statistics for the two corpora in terms of number of videos, number of words, and number of users from whose profiles the data were extracted.</title>
        <p>The two corpora are registered in CLARIN 7, but not
publicly accessible for the moment.
6[our translation] "making this video is really really hard for me
but I am doing it because I want to share everything about my life with
you and I want to help those who are experiencing the same situation
by talking about my problem you must know that I have sufered first
from anorexia I ended up weighting 36 kg and I will tell you about
the trigger then I rediscovered food and started insanely binging and
feeling guilty and then as a consequence throwing up this is called
bulimia obviously I alternated periods of fasting so peraphs I would not
eat for days with periods in which my body needed food and I would
eat anything and I just wanted to tell you that yesterday it happened
again and the thing is that I see it immediately on my face that is I
see myself 10 times bigger and I fell really extremely bloated that you
know but I managed not to throw up because I am stronger I am with
you all"
7http://hdl.handle.net/20.500.11752/OPEN-997</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Conclusion and Future Works</title>
      <sec id="sec-2-1">
        <title>The aim of this work was twofold: on the one hand, we</title>
        <p>wanted to present two corpora on EDs, the English
proAna corpus (PAC) and the Italian pro-Recovery corpus
(RAC), that were both built by extracting data from the
popular SM TikTok; on the other, we wanted to discuss
some methodological issues related to building a corpus
using this platform as a source of data. More specifically,
we pointed out that the absence of an oficial API does not
allow the automatic extraction of the videos and requires
manual work, which is highly time-consuming and does
not allow to collect a very large sample of data. This,
in turn, might impede the application of more complex
computational analysis and limit the generalizability of
the results. In addition, we raised the issue related to
the transcription of the videos to text. In this case,
implementing automatic approaches is not always feasible
because of the extreme visual complexity and variability
of TikTok videos.</p>
        <p>Given the highly interactive nature of this SM and its
unprecedented success, we believe that TikTok
constitutes an extremely interesting source of linguistic and
non-linguistic data that could be used to analyze other
complex social and psychological phenomena and we
hope that this work paves the way for further research
in this direction.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>CRediT authorship contribution statement</title>
      <sec id="sec-3-1">
        <title>MD Conceptualization, Methodology, Software, Data Cu</title>
        <p>ration (i.e., download, automatic transcription,
annotation), writing (§2,3,4)
LP Data Curation (i.e., manual transcription)
PV Conceptualization, Data Curation (i.e., download),
Writing (§1)
GG Supervision, Funding acquisition.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Funding</title>
      <sec id="sec-4-1">
        <title>This work was partially funded by the RaAM Association</title>
        <p>(project “How about metaphors for dinner? A digest of
metaphorical conceptualizations in pro-Ana
communities”) and the University of Bologna (AlmaIdea 2022
“MetaphAN” project).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          (
          <year>2020</year>
          )
          <fpage>1219</fpage>
          -
          <lpage>1223</lpage>
          . [26]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Herrick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hallward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Duncan</surname>
          </string-name>
          , “this is
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>of eating disorders 54</source>
          (
          <year>2021</year>
          )
          <fpage>516</fpage>
          -
          <lpage>526</lpage>
          . [27]
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Garcìa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Dìez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. M.</given-names>
            <surname>Sànchez</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>ana and pro-mia resource</article-title>
          ,
          <source>European Psychiatry 64</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          (
          <year>2021</year>
          )
          <fpage>S703</fpage>
          -
          <lpage>S703</lpage>
          . [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>González-Nuevo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cuesta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Muñiz</surname>
          </string-name>
          , Concern
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>berspace 15</source>
          (
          <year>2021</year>
          ). [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Minadeo</surname>
          </string-name>
          , L. Pope, Weight-normative messag-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>analysis</surname>
          </string-name>
          ,
          <source>Plos one 17</source>
          (
          <year>2022</year>
          )
          <article-title>e0267997</article-title>
          . [30]
          <string-name>
            <given-names>M.</given-names>
            <surname>Donati</surname>
          </string-name>
          , C. Strapparava, CorEDs: A cor-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>within the 13th Language Resources</article-title>
          and Evaluation
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>ciation</surname>
          </string-name>
          , Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>85</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          https://aclanthology.org/
          <year>2022</year>
          .rapid-
          <volume>1</volume>
          .
          <fpage>10</fpage>
          . [31]
          <string-name>
            <given-names>V.</given-names>
            <surname>Richichi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chinello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Parma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Zappa</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>clinica dello sviluppo 22</source>
          (
          <year>2018</year>
          )
          <fpage>499</fpage>
          -
          <lpage>514</lpage>
          . [32]
          <string-name>
            <given-names>N. L.</given-names>
            <surname>Bragazzi</surname>
          </string-name>
          , G. Prasso, T. S. Re, R. Zerbetto,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Risk management and healthcare policy (</article-title>
          <year>2019</year>
          )
          <fpage>145</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          151. [33]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gagliardi</surname>
          </string-name>
          , “
          <article-title>odio tutto ciò, voglio le ossa”: Una</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>iano LinguaDue 13</source>
          (
          <year>2021</year>
          )
          <fpage>520</fpage>
          -
          <lpage>536</lpage>
          . [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sherman</surname>
          </string-name>
          ,
          <article-title>Tiktok reveals detailed user numbers</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>for the first time</article-title>
          ,
          <source>Retrieved October</source>
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <year>2020</year>
          . [35]
          <year>2023</year>
          . URL: https://www.tiktok.com/legal/page/eea/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>privacy-policy/en</article-title>
          . [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Pezeshki,
          <string-name>
            <given-names>P.</given-names>
            <surname>Brakel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. L. Y.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>works</surname>
          </string-name>
          ,
          <source>arXiv preprint arXiv:1701.02720</source>
          (
          <year>2017</year>
          ). [37]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ooms</surname>
          </string-name>
          , tesseract: Open Source OCR Engine,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>2023. Https://docs.ropensci.org/tesseract/ (website)</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>