<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Measuring the Severity of the Signs of Eating Disorders Using Machine Learning Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sachin Prasanna</string-name>
          <email>sachinprasanna.211it058@nitk.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abhayjit Singh Gulati</string-name>
          <email>abhayjitsinghgulati.211ee102@nitk.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Subhojit Karmakar</string-name>
          <email>subhojit.211it071@nitk.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M Yoga Hiranmayi</string-name>
          <email>myogahiranmayi.211it038@nitk.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar Madasamy</string-name>
          <email>m_anandkumar@nikt.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Technology Karnataka</institution>
          ,
          <addr-line>Surathkal, Mangaluru, 575025</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents the results submitted by Team SCaLAR-NITK for task 3 of eRisk Lab at CLEF 2024 [1]. The dataset provided by the task organizers consisted of 74 subjects for training and 18 for testing. We begin by describing the data cleaning and preprocessing steps. Subsequently, we outline various approaches used to address the problem, such as Word2Vec, TF-IDF, Backtranslation and Dimensionality Reduction, among others. Finally, we summarize the results obtained from each approach. Our solutions demonstrated strong performance, achieving the best results in 7 out of the 8 evaluated metrics.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Word2Vec</kwd>
        <kwd>TF-IDF</kwd>
        <kwd>Backtranslation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Eating disorders, such as anorexia nervosa, bulimia nervosa, and binge eating disorder, are serious
mental health conditions characterized by abnormal eating habits and distorted body image [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These
disorders often stem from a combination of genetic, psychological, and social factors. Individuals may
restrict food intake, engage in binge eating followed by purging behaviors, or compulsively overeat.
Eating disorders can have devastating efects on physical health, leading to malnutrition, electrolyte
imbalances, and organ damage. Psychological impacts include depression, anxiety, and low self-esteem.
Treatment typically involves a combination of therapy, nutritional counseling, and medical supervision
to address both physical and mental aspects of the disorder.
      </p>
      <p>In this paper, we discuss methodologies for detecting signs of eating disorders from users’ Reddit
posts. The first approach involves creating separate models for each of the 22 questions, with each
model learning the distribution specific to its corresponding question. The second approach reshapes
the dataset to include the text of the question in a separate column, allowing the use of a single model for
both training and predicting, rather than maintaining 22 individual models. The third approach employs
Principal Component Analysis (PCA) for dimensionality reduction, ensuring that the importance of the
text related to the question is weighted at a ratio of 3:1. These methodologies aim to efectively identify
patterns and indicators of eating disorders in user’s reddit posts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>The dataset was given in the form of XML files which had to be cleaned for further processing. The
XML files consisted of the user names, the posts they had posted and their timestamps. The answers
given by the subjects (true labels) were given as a separate text file.</p>
      <sec id="sec-2-1">
        <title>2.1. Data Cleaning</title>
        <p>Several XML had loading issues and were incorrectly formatted. So as a first step, these issues were
resolved and the XML file for each user was converted to CSV files for further cleaning and usage. It
was also found that column names had a mismatch and was fixed.</p>
        <p>Further cleaning involved the removal of emojis from the posts of users using the emoji library. It
was also found that several posts were enclosed with starting with b" and ending with ". These were
removed accordingly. Unicode representations were replaced with their actual representations. For
example, \xe2\x80\x99 means ’ and instances such as these were found and replaced.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Preprocessing</title>
        <p>After cleaning, all posts from a single user were concatenated into one single chunk and preprocessed
using standard preprocessing methods. This text was lowercased and URLs were removed. All kinds
of punctuation were removed. Stop words were removed so that the machine learning models are
not heavily influenced by the efects of these words. Finally, the words were lemmatized using the
WordNetLemmatizer function from the nltk.stem library.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Three diferent approaches are proposed in our solution to the problem. They are described and
explained below.</p>
      <sec id="sec-3-1">
        <title>3.1. Diferent Models for Each Question</title>
        <p>
          This approach consisted of fitting a model to each question. Since there were 22 questions answered
by each subject, we made 22 diferent models to learn the distribution of each question’s answers.
Thereafter, the Linear Support Vector Machine algorithm was used for classification [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>The Linear Support Vector Machine model implemented here utilizes stochastic gradient descent
with a hinge loss function for maximum-margin classification. It applies L2 regularization to prevent
overfitting and employs hyperparameters for regularization strength, reproducibility, the maximum
number of iterations and the tolerance of stopping criteria. It is particularly suited for large datasets
and online learning scenarios.</p>
        <p>For word embeddings for input to the models, a pipeline was constructed. The first component was
the CountVectorizer, which converts the collection of text documents into a matrix of token counts. Then,
the standard TF-IDF approach is taken and the corresponding matrix is constructed. These embeddings
are taken for each document and fed into the machine learning algorithms and neural network for each
question.</p>
        <p>This approach demonstrated good results, and got the best score in 2 of the measuring metrics. The
reason as to why we went with the simpler approach is because complex models would overfit the
data, hence resulting in poorer scores, as observed in the results of the same task last year. Clearly, the
approach paid of with good results.</p>
        <p>Although a good approach with decent results, the importance of the questions was not given as
inputs to the models. It was just a case of learning 22 arbitrary distributions and then predicting the
same. This emerged as a drawback as the questions were not given any importance in the predictions.
Also, since the training set was small, there was a chance that the training was not suficient for
predicting the validation labels correctly.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Extending Dataset and using Questions along with Word2Vec and</title>
      </sec>
      <sec id="sec-3-3">
        <title>Backtranslation</title>
        <p>To overcome the drawback of the previous approach, the dataset was reshaped to include the text
of the question in a separate column. This also includes the size of the dataset and makes use of a
single model for training and predicting, rather than 22 models as in the previous approach. A pictorial
representation of the transformation is shown in Figure 1.</p>
        <p>
          After the transformation, word embeddings for both text and question was constructed using
Word2Vec. Pretrained Word2Vec models, ofer precomputed word embeddings trained on vast amounts
of text data, such as Google News articles as used in our work. The loaded model, pre-trained on
Google News corpus (3 billion running words) word vector model (3 million 300-dimension English
word vectors), captures intricate semantic nuances and gives numerical meaning to text [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Since each
word is represented by a 300 size vector, the average vector of all the words was taken as the final
representation of each subject. If a word was not present in the Word2Vec corpus, then its vector was
taken as 0.
        </p>
        <p>Separate word embeddings are generated for the text and question. Each of these are 300 dimensional
vectors, and were concatenated to form the final embedding vector which was used for training which
turned out to be a 600 dimensional vector. By this way, we ensured that both the text and question
are given equal weightage during prediction of labels. An illustration of our methodology is shown in
Figure 2.</p>
        <p>
          The imbalance in training data was taken care of by Backtranslation. Backtranslation is the process
of translating text from one language to another and then translating it back to the original language
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We translated our text from English to French and vice versa. This technique is commonly used in
natural language processing for data augmentation and improving the robustness of machine learning
models. An illustration of Backtranslation is shown in Figure 3. The split of labels is represented
pictorially in Figure 4.
        </p>
        <p>Our implementation handles cases where the input text may exceed character limits specified by
GoogleTranslate API (4999 characters) by splitting it into smaller chunks. It utilizes the Google Translator
library to perform translation tasks eficiently.</p>
        <p>Backtranslation was performed for only the text column and only for those instances with labels
2,3,4 and 5. These were then added to the dataset as new instances, which increased the weightage of
the minority classes. This can be pictorially seen in Figure 5.</p>
        <p>Using these new embeddings - Linear Support Vector Machine and Gradient Boosting algorithms
were used to train the data.</p>
        <p>
          The Gradient Boosting model utilized in this implementation is a powerful ensemble learning
method widely used for classification tasks. It works by sequentially adding weak learners, typically
decision trees, to correct the errors made by preceding models. Each subsequent learner focuses on
the residual errors of the previous model, gradually improving the overall predictive performance [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
The key hyperparameters include the number of estimators and the learning rate, which control the
model’s complexity and the rate at which each additional learner contributes to the ensemble.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.3. Using Word2Vec with Backtranslation and Dimensionality Reduction</title>
        <p>An innovative method to reduce the weightage of the question embeddings is proposed using
dimensionality reduction. Principle Component Analysis (PCA) was used for the purpose of dimensionality
reduction.</p>
        <p>
          PCA (Principal Component Analysis) is a dimensionality reduction technique that transforms
highdimensional data into a lower-dimensional representation while preserving most of its variance. It
achieves this by identifying the principal components, which are orthogonal directions in the original
feature space that capture the maximum variance in the data [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. It was used to reduce to the dimensions
of the question embeddings from size 300 to size 100.
        </p>
        <p>This ensured that the importance of text to question is in the ratio of 3:1. The approach is illustrated
in Figure 6.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The results of our endeavours are shown in Table 1. For this task, 5 teams took part and 14 diferent
solutions or runs were submitted, along with 3 baseline solutions described by the authors. Our solutions
performed exceedingly well and we had the best results in 7 out of the 8 given metrics.</p>
      <p>Citing previous year results, complex deep learning based solutions like transformers did not give
great results - most likely because of the overfitting problem. Hence we decided to go with a simpler
machine learning based approach to tackle the problem. It can be inferred that when the size of the
dataset is small, simpler solutions ofer better results than complex Deep Learning based solutions.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper outlines Team SCaLAR-NITK’s involvement in Task 3 of the eRisk@CLEF 2024 edition,
where we investigated diverse techniques for assessing eating disorders across multiple users based
on their Reddit contributions. Our investigation focused on two primary approaches to tackle the
problem. The first approach employed 22 distinct models, each tailored to a specific question, while the
second approach utilized a single model capable of capturing the essence of both the questions and the
posts. Innovative approaches such as Backtranslation was used to balance the label distribution and
Principle Component Analysis (PCA) was used to better assign the weightages of text and question. Our
contributions aim to address complex societal challenges in mental health detection and intervention.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2024:
          <article-title>Early risk prediction on the internet</article-title>
          ,
          <source>in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. 15th International Conference of the CLEF Association, CLEF 2024</source>
          , Springer International, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Anwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fuller-Tyszkiewicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Jarman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abuhassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shatte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sukunesan</surname>
          </string-name>
          , Edbase:
          <article-title>Generating a lexicon base for eating disorders via social media</article-title>
          ,
          <source>IEEE Journal of Biomedical and Health Informatics</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Swetapadma</surname>
          </string-name>
          ,
          <article-title>A study on support vector machine based linear and non-linear pattern classification</article-title>
          ,
          <source>in: 2019 International Conference on Intelligent Sustainable Systems (ICISS)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wallace</surname>
          </string-name>
          , T. Kecahdi,
          <article-title>Outlier detection in health record free-text using deep learning</article-title>
          ,
          <source>in: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Zhou, Data augmentation via back-translation for aspect term extraction</article-title>
          , in: 2023
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <article-title>Gradient boosting based classification of ion channels</article-title>
          ,
          <source>in: 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sehgal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bhasker</surname>
          </string-name>
          , Shantanu,
          <article-title>Data analysis using principal component analysis</article-title>
          ,
          <source>in: 2014 International Conference on Medical Imaging, m-Health and Emerging Communication Systems (MedCom)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>