1. Introduction

Building a corpus on Eating Disorders from TikTok: challenges and opportunities

Melissa Donati

Ludovica Polidori

Paola Vernillo

Gloria Gagliardi

0 0 Alma Mater Studiorum - University of Bologna , Italy

We present two synchronic corpora of Eating Disorders (ED) related discourse on Social Media. PAC (i.e., ProAna/Anorexia Corpus) and RAC (i.e., Recovery from Ana/Anorexia Corpus) resources focus on the contents posted on TikTok, respectively, by communities promoting anorectic behavior and users sharing experiences concerning the process of recovery from their ED. We report on the corpus statistics and creation process, focusing specifically on the methodological issues raised by this novel Social Media platform.

eol>Eating Disorders Corpus Linguistics TikTok

1. Introduction

further compromised by the inhomogeneous representation of linguistic data in the literature, where the majority It was only 20 years ago that one of the darkest sides of of studies have been dedicated to the linguistic profiling Eating Disorders (ED) was revealed through the prolifer- of ED-afected individuals in a Germanic language (Enation of websites, blogs, and social networks, in which a glish, German, Norvegian) [18]. This paper represents growing number of adolescents and young adults started a small step towards the reversal of this tendency but a sharing information about their eating experiences with crucial part of two larger projects (Metaphan1 and RaAM like-minded users. Among these pro-ED communities, project 20222) aiming at identifying, by the adoption of researchers and clinicians showed particular concern for diferent NLP techniques and tools, potential lexical and pro-Ana (i.e., “pro-anorexia”) groups, i.e., web-based com- semantic patterns in anorectic individuals. To this end, in munities of anorexic (or aspiring anorexic) individuals the current research, we show the data collection process engaged in the promotion of their Eating Disorder [1]. (i.e. oral and written productions) from ED communiInterestingly, one of the most horrific and dangerous as- ties on TikTok, currently representing the most widely pects of pro-Ana groups is that Anorexia Nervosa (AN) used social media among young people and adolescents, is not presented as a psychiatric disorder associated with namely the population groups at greater risk for EDs. In pathological body image dissatisfaction [2], but more as the following paragraphs, we give a brief overview of the a way of living with its own rules and rituals to be re- literature on the topic (Section 2), then we describe the spected. While over the last years, much has been done to process of creating the corpus and discuss the methodprevent the circulation of pro-ED content on social media ological issues that were met (Section 3) and to conclude (e.g., TikTok’s adoption of measures to obscure harmful we provide few insights for future works (Section 4). contents: [3]), a new but specular phenomenon recently took the toll, that is, the spread of pro-recovery accounts of individuals who are in the process of healing from an 2. Related Works ED and are willing to share their eating experience to help other online users [4]. From a linguistic perspective, In recent years, we have witnessed exponential growth research on ED has been very limited and became an in the use of Social Media (SM), especially by adolescents object of study only in recent years [5, 6, 7, 8, 9, 10] as and young people. The community-building nature and opposed to other psychopathologies, such as schizophre- the interactive dynamics of these platforms, as well as nia [11, 12], personality disorder [13], and depression the less direct way of communicating, encourage users to [ 14, 15, 16, 17 ]. This already problematic picture has been openly discuss a wide variety of topics [19]. In turn, this makes available huge amount of data that can be used for diferent purposes (e.g. extract actionable patterns, form conclusions about users, conduct research, etc.).

For this reason, Social Media Mining (SMM), i.e., the process of extracting big data from SM, now constitutes

1https://site.unibo.it/metaphan/en

2https://site.unibo.it/metaphan/en/ connected-research-activities a well-established methodology to collect large samples supporters of anorectic behaviors (for the English PAC of data in diferent research areas [ 20]. This approach corpus); ii) witnesses and motivators for the recovery has proved particularly fruitful for collecting data on process (for the Italian RAC corpus). Such profiles were EDs as people sufering from these disorders seem to identified based on the linguistic and non-linguistic (i.e., overcome the self-protective nature of their ED to engage emojis) information present in their profile bio. The sein ED-related discourse with online users sharing similar lection criteria will be presented in Section 3.1, prior to experiences [21]. Indeed, in the last decade, many studies the description of the data collection process and the have used diferent SM platforms as a source of data to discussion of the related issues that were encountered. analyze EDs [22, 23, 24, 25, 21, 26, 27, 28, 29]. However, Before getting further into the methodology, it is necthe state-of-the-art on ED-discourse on SM currently essary to make an ethical consideration concerning the presents two main limitations: i) the majority of the collection of data from SM. Broadly speaking, SM posts analysis was carried out on small datasets built ad-hoc that are publicly accessible are treated as belonging to for the purpose of the work (with the only exception of the public domain, therefore, according to common prac[30]), and ii) they mostly focused on the English language. tice, consent from the creators is not deemed necessary As a matter of fact, in the Italian framework there have to download such data. This is strengthened by the fact been very little research on the representation of EDs that, upon registration, TikTok asks its users to consent on SM, and that little was mostly focused on Anorexia- to a set of terms of service that make the data available Nervosa and did not target EDs in general [31, 32, 33]. for access to third parties [35]. In addition, when creating and managing their accounts and contents users can decide to make them publicly accessible or private 3. Corpus Creation: (i.e. only viewable by accepted followers); at any time, Methodological Issues they can also restrict access to some of their contents through privacy settings and choose whether to make them downloadable. For the above reasons, given that for the purpose of this work only public and downloadable data was analysed, we did not seek users’ consent to collect the posts. In compliance with similar SM analysis [26], no reference to any identifying information, such as usernames, will be made.

Against this background and intending to fill this gap, we created a collection of English and Italian ED-related data that could be used for diferent types of research (from purely linguistic and content analyses that could help pinpointing the features and characteristics of ED-related discourse, to various computational techniques that could be used to implement systems of automatic detection of ED-related content on SM). We selected TikTok as a source of data as it currently represents the most widely used SM, especially among young people and adolescents, namely the at-risk population for EDs [34].

To achieve this goal, we first needed to define the nature and characteristics of the corpus itself. As far as the linguistic features are concerned, our corpus is specialized (i.e., is focused on the topic of EDs discourse on TikTok), synchronic (i.e., refers to a specific point in time that is the moment the data were downloaded), and targets both written and spoken language (TikTok videos contain spoken and/or written text). We did not set a priori a target dimension to be reached, because this feature is totally dependent upon the possibility of extracting the data automatically (Section 3.1). Conversely, following the common practice in the domain of SMM, we assumed that ‘there is no data like more data’ and intended to download as many videos as possible. To maximize the corpus representativity, we tried to balance the sample with respect to the types of videos being collected but we could not do so concerning the users’ gender, because for both corpora the vast majority of profiles were of female individuals (see Section 3.3 for more details). The target population consisted of those profiles that identify themselves in one of the two following categories: i)

3.1. Data Collection

As explained above, the selection criteria adopted to identify the target profiles was based on the information present in the profiles’ bio. However, to track the target profiles, we needed to start from a list of ED-related hashtags that could lead us to such profiles via a keywordbased search. The hashtags that were used herein were generated both by brainstorming and by exploring the platform for a couple of weeks, noting down the most popular trends and the most widely used hashtags (see Table 1 for an overview). Following this hashtag-driven search, we noticed that there was very little -if any- proAna content produced in Italian, that is why for this type of ED-related content we decided to collect a small sample of English data. On the other hand, we found quite some profiles representing the ED-recovery community.

Among these profiles, we selected those having at least 10k followers (some of them exceed 2M followers) and at least 10 ED-related posts, so that we could maximize the chance of gathering interesting and relevant linguistic information. We then used the ED-related hashtags to conduct a within-profile research to select only the EDrelated videos in each profile in order to extract them.

At this point, the next step consisted of extracting the

Organizing the videos into 4 categories was particularly

useful for the transcription phase as it allowed to adopt Pro-Ana Pro-Recovery diferent strategies and techniques based on the input hashtags hashtags characteristics. As for the downloading phase, although (#ww3eiigghhttll0ossss) (#ddccaarr3eccoovveerryy) we intended to automatize the transcription process as #unhealthyweightloss #dca 4 #dcaitalia much as possible, the high complexity of the data has, in (+ lexical variations) #fiocchettolilla 5 some cases, made human intervention necessary. #kpop3 #dcafighting For speech-only and playback videos automatic transcription was performed using the Google Web Speech API, which is easily accessible through the SpeechRecognition Library [36]. To assess the quality of the autoidentified ED-related videos from the selected profiles. matic transcription, a random sample of videos (n=10) For the sake of time and eficiency, we wanted to down- for each category was extracted, transcribed manually load the data automatically. However, diferently from and then compared with the machine-based transcripother popular SM, TikTok has not yet released any oficial tion. For speech-only videos, a high agreement score API that can be used by researchers and developers to was obtained between human and machine transcripautomate the process of accessing and extracting the data. tion (>90%) which confirmed the viability of the method In addition, even if unoficial APIs exist, they get outdated adopted. Conversely, playback videos emerged as more almost immediately after their release because TikTok problematic, thus manual correction was needed because is constantly updating the anti-bot system preventing both singing and the music accompaniment adversely automatic access from the same IP. To get around this, impacted on intelligibility. we looked for a reliable and cost-efective proxy provider Automatic transcription was also attempted for textfor TikTok scraping, but we could not find any viable only videos by means of Optical Character Recognition solution. (OCR) using the Tesseract OCR engine [37], but we ob

Therefore we decided to proceed with the manual tained poor results due to the high visual complexity of downloading of the data. The main drawback of this way the input data, more specifically to the extreme variability of proceeding is that due to time and resource constraints of font type, size, and color, the lack of adequate contrast we could not collect a very large number of videos (see with the background, the non-hierarchical spatial organiTable 2). On the bright side, however, the manual down- zation of texts, and the presence of non-textual graphical loading allowed us to i) enhance the content filtering elements (e.g., lexical variations of words, where letters process and ii) notice that TikTok videos have diferent are substituted by numbers or emojis to prevent the platformatting styles that might be worth distinguishing not form’s censorship and filtering system from blocking the only to ease the ensuing transcription process but also to content as potentially harmful, e.g., ‘starving’ written conduct separate content analysis and compare the difer- replacing star with the corresponding emojis, or ‘disorent results. Based on our observations about the diferent der’ written as ‘d1s0rder’). The same issue, boosted to formatting styles, we grouped the TikTok videos into 4 the maximum, was observed with mixed videos, where subcorpora: 1)Speech-only videos: in which the user was speech, music, and written text were mingled. Therefore, talking in the absence of background music and/or writ- for these two categories of videos, we could only perform ten text; 2) Playback: in which the user lip-sync over a the transcription manually. song or an extract from a movie or tv shows; 3) Text-only: We reported below, as an example of the type of EDin which there is neither background music nor the users related content that was selected, the transcription of themselves speaking, but only written text superposed on two videos, one for each of the two datasets. the video; and 4) Mixed: in which the above-mentioned features are present in various combinations. [from RAC]

3K-pop (for Korean-pop) is a popular genre of music originating from South Korea that has been hugely influential in the ‘diet scene’ because young people want to look like their favourite K-pop stars that are known for their extreme diets, indeed many young artists have left behind the K-pop world in order to focus on eating disorder treatment.

4Disturbo del Comportamento Alimentare (Eating Disorder).

5The Lilac Ribbon is the oficial international symbol against Eating Disorders. "questo video è davvero davvero dificile da registrare per me ma lo faccio perché voglio condividere tutta la mia vita con voi e voglio aiutare delle persone che si trovano nella mia stessa situazione parlando del mio problema dovete sapere che io sono stata prima anoressica sono arrivata a pesare 36 kg e vi parlerò poi te la causa scatenante poi riscoperto il cibo ho iniziato ad abbufarmi in una maniera assurda a sentirmi in colpa e quindi poi a vomitare questa si chiama bulimia ovviamente alternavo momenti digiuno quindi magari non mangio proprio per giorni a momenti in cui il tuo corpo ha bisogno di cibo e quindi ti abbufi e mangi qualsiasi cosa volevo solo dirvi che ieri è successa un’altra volta il fatto è che io me lo vedo subito in faccia cioè mi vedo 10 volte più grossa e mi sento davvero super gonfia che senti ma sono riuscita a non vomitare perché io sono più forte sono con tutte voi6" [from PAC] "i’m *** i’m a new member stats starter weight 140.1 ibs goal weight 100 ibs ultimate goal weight 90 ibs for now i binge eat when i’m bored so i gained a lot of weight in the past months i’m trying to limit myself on eating i am currently 4’10 and i’m overweight for my height age i listen to subliminal and trying to workout also i hate exercising but i realized it is healthy for me and my body 33"

3.3. Corpus Statistics In Table 2, we reported an overview of the statistics for the two corpora in terms of number of videos, number of words, and number of users from whose profiles the data were extracted.

The two corpora are registered in CLARIN 7, but not publicly accessible for the moment. 6[our translation] "making this video is really really hard for me but I am doing it because I want to share everything about my life with you and I want to help those who are experiencing the same situation by talking about my problem you must know that I have sufered first from anorexia I ended up weighting 36 kg and I will tell you about the trigger then I rediscovered food and started insanely binging and feeling guilty and then as a consequence throwing up this is called bulimia obviously I alternated periods of fasting so peraphs I would not eat for days with periods in which my body needed food and I would eat anything and I just wanted to tell you that yesterday it happened again and the thing is that I see it immediately on my face that is I see myself 10 times bigger and I fell really extremely bloated that you know but I managed not to throw up because I am stronger I am with you all" 7http://hdl.handle.net/20.500.11752/OPEN-997

4. Conclusion and Future Works The aim of this work was twofold: on the one hand, we

wanted to present two corpora on EDs, the English proAna corpus (PAC) and the Italian pro-Recovery corpus (RAC), that were both built by extracting data from the popular SM TikTok; on the other, we wanted to discuss some methodological issues related to building a corpus using this platform as a source of data. More specifically, we pointed out that the absence of an oficial API does not allow the automatic extraction of the videos and requires manual work, which is highly time-consuming and does not allow to collect a very large sample of data. This, in turn, might impede the application of more complex computational analysis and limit the generalizability of the results. In addition, we raised the issue related to the transcription of the videos to text. In this case, implementing automatic approaches is not always feasible because of the extreme visual complexity and variability of TikTok videos.

Given the highly interactive nature of this SM and its unprecedented success, we believe that TikTok constitutes an extremely interesting source of linguistic and non-linguistic data that could be used to analyze other complex social and psychological phenomena and we hope that this work paves the way for further research in this direction.

CRediT authorship contribution statement MD Conceptualization, Methodology, Software, Data Cu

ration (i.e., download, automatic transcription, annotation), writing (§2,3,4) LP Data Curation (i.e., manual transcription) PV Conceptualization, Data Curation (i.e., download), Writing (§1) GG Supervision, Funding acquisition.

Funding This work was partially funded by the RaAM Association

(project “How about metaphors for dinner? A digest of metaphorical conceptualizations in pro-Ana communities”) and the University of Bologna (AlmaIdea 2022 “MetaphAN” project).

( 2020 ) 1219 - 1223 . [26]

S. S.

Herrick ,

Hallward ,

L. R.

Duncan , “this is

of eating disorders 54 ( 2021 ) 516 - 526 . [27]

G. L.

Jordan ,

M. D.

Garcìa ,

B. L.

Dìez ,

P. M.

Sànchez ,

ana and pro-mia resource , European Psychiatry 64

( 2021 ) S703 - S703 . [28]

González-Nuevo ,

Cuesta ,

Muñiz , Concern

berspace 15 ( 2021 ). [29]

Minadeo , L. Pope, Weight-normative messag-

analysis , Plos one 17 ( 2022 ) e0267997 . [30]

Donati , C. Strapparava, CorEDs: A cor-

within the 13th Language Resources and Evaluation

ciation , Marseille, France, 2022 , pp. 80 - 85 . URL:

https://aclanthology.org/ 2022 .rapid- 1 . 10 . [31]

Richichi ,

Chinello ,

Parma ,

L. E.

Zappa ,

clinica dello sviluppo 22 ( 2018 ) 499 - 514 . [32]

N. L.

Bragazzi , G. Prasso, T. S. Re, R. Zerbetto,

Risk management and healthcare policy (

2019 ) 145 -

151. [33]

Gagliardi , “ odio tutto ciò, voglio le ossa”: Una

iano LinguaDue 13 ( 2021 ) 520 - 536 . [34]

Sherman , Tiktok reveals detailed user numbers

for the first time , Retrieved October 2 ( 2020 ) 2020 . [35] 2023 . URL: https://www.tiktok.com/legal/page/eea/

privacy-policy/en . [36]

Zhang , M. Pezeshki,

Brakel ,

Zhang , C. L. Y.

works , arXiv preprint arXiv:1701.02720 ( 2017 ). [37]

Ooms , tesseract: Open Source OCR Engine,

2023. Https://docs.ropensci.org/tesseract/ (website)