<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ELiRF-UPV at eRisk 2023: Early detection of pathological gambling using SVM.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Molina</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xinhui Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lluís-F. Hurtado</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ferran Pla</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Informatics School, Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In these working notes, we detail the experimentation carried out by the ELiRF-VRAIN team at the eRisk task 2: Early Detection of Signs of Pathological Gambling. We have tackled the task using a classic machine learning approach: Support Vector Machines. The only data used have been those provided in the task. Several configurations have been tested, including various kernels and text vectorization strategies, using a grid search approach. According to the preliminary results provided by the organizers of the task, the proposed system has obtained the best scores in terms of Precision, F1, 5, 50, and latency-weighted F1.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Support Vector Machine</kwd>
        <kwd>Pathological Gambling Detection</kwd>
        <kwd>Social Media Monitoring</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>disorders, is to expand GUAITA monitoring system. This would allow not only the monitoring
and analysis of the publications associated with a topic or event, but also profiling the users
of the social network. In this way, it could serve as a tool for the early detection of symptoms
associated with various mental disorders.</p>
      <p>Early pathological gambling detection task was introduced at eRisk 2021 [3]. In that edition
no training data was provided, so each system had to build its own training dataset. UNSL
system [4] tested several machine learning approaches including SVM with bag of words (BOW),
achieving the best performance in most of the decision-based performance metrics as well as in
ranking-based performance metrics. It distinguished two system components: CPI component
that predicts the risk probability, and DMC component that implements a rule-based early alert
policy. In the next edition, the 2022 eRisk shared task, a similar approach with some changes on
the alert decision policies was proposed [5].</p>
      <p>At eRisk 2022, NLP-IISERB team [6] tested diferent classifiers and feature engineering
techniques, including Ada Boost, Logic Regression, Random Forest and SVM classifiers, and
also pre-trained neural network models. They achieved the best results with the Random
Forest model using entropy-based BOW features. They also concluded that classical models
outperformed deep learning-based models in this task. Other teams that tested classical models
were BioNLP [7] and ZHAW [8]. BioNLP team evaluated the efect of balanced and unbalanced
datasets with the diferent models. ZHAW system is based on the UNLS system of eRisk2021,
with some modifications, in particular, they used GloVe for feature extraction. Although their
working notes show good system performance, they achieved in the shared task a very low
precision for diferent reasons.</p>
    </sec>
    <sec id="sec-2">
      <title>2. System description</title>
      <p>Once the task and its dataset was analyzed, we decided to tackle the task using a classic machine
learning approach: Support Vector Machines (SVM) [9]. The main reason for choosing this
approach was to be able to handle the size of the task samples. The total number of submissions
of a user can be very large, with an average number of tokens per user in the training set
between 11,821.23 for the positive samples (gamblers) and 14,416.23 for the negative samples
(control users), as shown in Table 1.</p>
      <p>One of the disadvantages of current Large Language Models (LLM) based on Transformers
architecture [10] is their limitation when handling large texts, requiring the use of some strategy
to fragment the samples. We thought that, in this task, LLM performance could be reduced by
having to limit the size of the input texts and therefore limiting limiting the history of writings
of the subject, which can lead to the loss of information to make a correct prediction.</p>
      <p>The SVM implementation provided by libsvm library2 was used, in particular, we use the
implementation included in the scikit-learn package: the sklearn.svm.SVC class.</p>
      <p>To select the best model, we performed an exhaustive grid search over specified parameter
of the SVM classifier. To do that, we divided the provided training set and reserved a part as
validation set.
2LIBSVM: A Library for Support Vector Machines http://www.csie.ntu.edu.tw/ cjlin/papers/libsvm.pdf</p>
      <p>We defined four diferent configurations according to the variations of the vectorization and
the preprocessing of the submissions, as we will explain below. We selected the best model
parameters for each configuration. To do that, we run a ten-fold cross validation for each
configuration and selected the parameters which maximized the average balanced accuracy.</p>
      <p>Concretely, the following parameters were tested during the tuning phase:
• The regularization parameter C value: 0.001, 0.01, 0.1, 1, 10 and 100.
• Diferent kernels: ’rbf’, ’sigmoid’, ’linear’ and ’poly’.</p>
      <p>• Diferent degrees for ’poly’ kernel: 2, 3, 4 and 5.</p>
      <p>Once the best models for each configuration have been determined, we test them on the
validation set and chose the one that provided the best  1 value.</p>
      <p>Finally, we learnt a new model over the whole dataset provided for the task with the selected
parameters. We used this final model to participate in the competition.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimentation details</title>
      <p>In this section, we present the dataset used and the experimental work conducted in this
competition.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>The training dataset used has been strictly the one provided by the organizers of the task. It
consisted of a list of messages ordered chronologically for a set of users. Each of the users or
subjects was identified as a pathological gambler or as a control user. The provided dataset
corresponds to the test datasets used in the previous editions of eRisk in 2021 [3] and 2022 [11].</p>
        <p>To carry out the experimentation, we took the test set of the eRisk2021 task as training set,
and the test set of the eRisk2022 task as the validation set. The complete statistics of the training
and test datasets can be seen in Table 1The statistics have been calculated on the original
datasets without any type of preprocessing.</p>
        <p>Each sample of the training and test sets consisted of the concatenation of all the writings of
the subject, including the title if it was present.</p>
        <p>C1
C2
C3
C4</p>
        <p>
          No
No
Yes
Yes
(
          <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
          )
(
          <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
          )
(
          <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
          )
(
          <xref ref-type="bibr" rid="ref1 ref2">1,2</xref>
          )
        </p>
        <p>Score</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Word vectorization</title>
        <p>The input data has been vectorized using the scikit-learn class
sklearn.feature_extraction.text.Tfidf Vectorizer . The features corresponded exclusively to
the tokens of the submission texts. We initially limited a maximum number of features to
5,000. Therefore, the shape of the training vector was 2,184 x 5,000. We tested four diferent
configurations, all of them with the max_features parameter set to 5,000:
• C.1. Default options of Tfidf Vectorizer, using word unigrams
• C.2. Default options of Tfidf Vectorizer, using word unigrams and bigrams
• C.3. Default options of Tfidf Vectorizer, with previous preprocessing of texts, using word
unigrams.
• C.4. Default options of Tfidf Vectorizer, with previous preprocessing of texts, using word
unigrams and bigrams.</p>
        <p>The preprocessing of the texts consisted of removing all punctuation marks, numerical
expressions, stopwords, and urls; we lowercase all the text. Finally, we used lemmas instead of
words.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model fitting</title>
        <p>To adjust the model parameters, a stratified cross-validation was performed on the training
corpus using a 10-fold strategy. This adjustment was made for the four configurations mentioned
above. Table 3 shows the results obtained by the four configurations. Since the dataset was
unbalanced, 164 positive samples vs. 2180 negative samples, we used the balanced accuracy as
measure to compare and select the models.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Results</title>
        <p>Each of the four models obtained were evaluated on the validation set. To make the prediction
about a subject, the concatenation of all its submissions is provided as input to the model. The
results obtained are shown in Table 3. It can be observed that:
• Models that used preprocessing obtained better precision results (0.987 with C3
configuration and 0.985 with C4) than those that did not use it (0.968 with C1 configuration and
0.966 with C2)
• Models without preprocessing achieved better recall and  1 scores.
• Models that included bigrams in the vectorization do not improved those that only
included unigrams.
• Models with preprocessing obtained better  1 results (0.918 with C3 configuration
and 0.906 with C4)</p>
        <p>The criterion chosen to participate in the task was the configuration that maximized the value
of  1. Consequently, we chosen the C3 configuration. It has the following characteristics:
the texts have been preprocessed as indicated above, the vectorization process included only
unigrams and 5000 features, and the model parameters were: C= 10, kernel= ’poly’, degree= 2.
Table 2 summarizes the configurations tested.
prediction. We calculated some statistics about the subjects detected as true positives: the
average number of submissions per user needed by the system to identify a true positive was of
12.2 submissions per user with a standard deviation of 28.7.</p>
        <p>Most of the true positive subjects were identified with a low number of submissions.
Specifically, 36% of the them were identified in the first submission, 75% of them were identified before
10 submissions and only 9% of the subjects needed more than 20 submissions.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and Future Work</title>
      <p>We have presented our approach to eRisk task 2: Early Detection of Signs of Pathological
Gambling. Due to the amount of text in each sample, to use an SVM-based approach was
decided. We perform a grid search to select the best configuration of the SVM model. The results
obtained support the correctness of our method to address task 2 of the eRisk competition.
This results, provided by the organizers of the task in the preliminary report, were: Precision:
1.000, Recall: 0.883 F1: 0.938, 5: 0.026, 50: 0.010, latency  : 4.0, speed: 0.988 and
latency-weighted F1: 0.927.</p>
      <p>As future work, we intend to explore the use of pretrained Large language models to address
this task. This includes the definition of strategies to handle inputs longer than those available
in the input layer of the models.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partially supported by MCIN/AEI/10.13039/501100011033, by the "European Union
and “NextGenerationEU/MRR”, and by “ERDF A way of making Europe” under grants
PDC2021120846-C44 and PID2021-126061OB-C41. It is also partially supported by the Generalitat
Valenciana under project CIPROM/2021/023 and PROMETEO/2020/024, and by the Universitat
Politècnica de València under the grant PAID-01-22 for pre-doctoral contracts for the training
of doctors.
Interaction. 14th International Conference of the CLEF Association, CLEF 2023, Springer
International Publishing, 2023.
[3] J. Parapar, P. Martín, D. E. Losada, F. Crestani, Overview of eRisk 2021: Early Risk Prediction
on the Internet, in: L. G. B. L. H. M. A. J. M. M. F. P. G. F. N. F. e. K. Selcuk Candan, B. Ionescu
(Ed.), Experimental IR Meets Multilinguality, Multimodality, and Interaction Proceedings
of the Twelfth International Conference of the CLEF Association (CLEF 2021), Springer
International Publishing, 2021.
[4] J. M. Loyola, S. Burdisso, H. Thompson, L. C. Cagnina, M. Errecalde, UNSL at erisk 2021:
A comparison of three early alert policies for early risk detection, in: G. Faggioli, N. Ferro,
A. Joly, M. Maistro, F. Piroi (Eds.), Proceedings of the Working Notes of CLEF 2021
Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21st - to
24th, 2021, volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 992–1021.
[5] J. M. Loyola, H. Thompson, S. Burdisso, M. Errecalde, UNSL at erisk 2022: Decision policies
with history for early classification, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast
(Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the
Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR
Workshop Proceedings, CEUR-WS.org, 2022, pp. 947–960.
[6] H. Srivastava, L. N. S, S. S, T. Basu, Nlp-iiserb@erisk2022: Exploring the potential of bag
of words, document embeddings and transformer based framework for early prediction of
eating disorder, depression and pathological gambling over social media, in: G. Faggioli,
N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022
Conference and Labs of the Evaluation Forum, Bologna, Italy, September 5th - to - 8th,
2022, volume 3180 of CEUR Workshop Proceedings, CEUR-WS.org, 2022, pp. 972–986.
[7] T. Dumitrascu, CLEF erisk 2022: Detecting early signs of pathological gambling using ML
and DL models with dataset chunking, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast
(Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the
Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR
Workshop Proceedings, CEUR-WS.org, 2022, pp. 883–893.
[8] S. Stalder, E. Zankov, ZHAW at erisk 2022: Predicting signs of pathological gambling
glove for snowy days, in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Proceedings
of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum,
Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop Proceedings,
CEUR-WS.org, 2022, pp. 987–994.
[9] V. N. Vapnik, The nature of statistical learning theory, Springer-Verlag New York, Inc.,
1995.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing
Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/
paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[11] J. Parapar, P. Martín, D. E. Losada, F. Crestani, Overview of eRisk 2022: Early Risk Prediction
on the Internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction
Proceedings of the Thirteenth International Conference of the CLEF Association (CLEF
2022), Springer International Publishing, 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hurtado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ahuir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Segarra</surname>
          </string-name>
          , E. Sanchis,
          <string-name>
            <given-names>M. J. C.</given-names>
            <surname>Bleda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <article-title>GUAITA: monitorización y análisis de redes sociales para la ayuda a la toma de decisiones (GUAITA: monitoring and analysis of social media to help decision making), in: M. A</article-title>
          .
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gómez-Rodríguez</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Vilares</surname>
          </string-name>
          , J. Vilares (Eds.),
          <source>Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD</source>
          <year>2022</year>
          )
          <article-title>co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2022</year>
          ),
          <string-name>
            <given-names>A</given-names>
            <surname>Coruña</surname>
          </string-name>
          , Spain,
          <source>September 21-23</source>
          ,
          <year>2022</year>
          , volume
          <volume>3224</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>82</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3224</volume>
          /paper19.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of eRisk 2023:
          <article-title>Early Risk Prediction on the Internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>