<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Representational learning for the detection of COVID related conspiracy spreaders in online platforms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adrián Girón Jiménez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ángel Panizo-LLedot</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier Torregrosa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Camacho</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. Computer Sciences, Universidad Rey Juan Carlos</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ETSI de Sistemas Informáticos, Universidad Politécnica de Madrid</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The approach known as representational learning is a set of techniques that allows the automatic discovery of features required for a machine learning task from raw data. In recent years, the application of these techniques to graphs has shown promising results in node classification tasks. This work applies representational learning to identify users that share COVID-related conspiracy theories, using their interactions with peers as the main features for the classification algorithms. To do so, Node2vec and FastRP were used to learn numeric representations, i.e. embeddings, of the users. Then, Random Forest and XGBoost were used for the downstream classification task. In addition, a pseudo-labeling procedure was applied. The experimentation shows that using interaction data for the classification task achieves better performance compared to classifying using only node attributes. Moreover, FastRP achieve better performance compared to Node2vec. However, pseudo-labeling does not improve the performance of the models at all. Finally, we reject the inclusion of "cannot determine" labels in our model, as they prove to be detrimental.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This work introduces a social network analysis approach to detect nodes spreading conspiracy
theories related to COVID1. The overview paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] explains the task in depth. The paper focuses
on the actors, rather than the messages, and their interactions within a network as features for
classification. In particular, it focuses on the use of representational learning techniques [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to
generate user embeddings in a semi-supervised manner, i.e. using unlabeled nodes related to
the original training sample, to be used in a downstream classification task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>
        Random Forest [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and XGBoost [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] were selected as classifiers heads due to their good general
performance in diferent tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Additionally, given the unbalanced nature of the dataset,
we have opted for the use of weights, assigning greater importance to the spreaders class.
Concerning the graph, due to its size, many of the techniques to be applied were not feasible.
Therefore, the most superfluous connections, i.e. those edges with a weight of less than a
threshold, were incrementally removed until a graph with a feasible size was reached. This was
achieved with a threshold of five. However, as this generated several connected components,
all the superfluous edges that touched any of the nodes under study, i.e. those with a label or
those that need to be labeled, were added. Finally, all nodes outside the biggest component were
discarded. The final graph had 1, 574, 681 nodes and 39, 946, 463 edges.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Node attributes only</title>
        <p>As a baseline standard, a classification model using only the node attributes was created. The
information from each node (Twitter account) available for the classifier is the following: creation
date (number of days after Twitter’s creation), description length, number of favorites, number
of statuses, number of friends, and country (as one hot encoding + "unknown_country"). All
the data was normalized between 0 and 1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Representational learning</title>
        <p>
          Representational learning techniques generate vectors (also known as embeddings) so that
nodes that are similar in the graph are closer together in the embedding space [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Once the
embeddings for each node were calculated, they were used in a downstream classification task.
For this work, two representation learning techniques were used: node2vec [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and FastRP [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
The former is a popular method that has proven good results in node classification tasks [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
The latter is a random projection algorithm that is capable of generating embeddings that take
into account node attributes, which node2vec cannot do.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Pseudo-labeling</title>
        <p>
          Pseudo-labelling is a semi-supervised technique that selects unlabeled samples that a model has
classified with high confidence and adds them to the training set. Rizve et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] argue that
pseudo-labeling performance is usually low due to erroneous high-confidence predictions from
poorly calibrated models; these predictions generate many incorrect pseudo-labels, resulting
in noisy training. To correct this problem they propose an uncertainty-aware pseudo-label
selection framework. Originally, the authors propose their framework to be used with neural
networks. Therefore, in this work, we adapted that framework to work with tree ensembles.
In particular, we changed the uncertainty estimation method MC-Dropout [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to the method
proposed by Polimis et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. "Cannot Determine" labels</title>
        <p>
          The ability of the model to identify when a sample cannot be determined was assessed using
two approaches. The first uses the output probabilities generated by the model. When the
probability is lower than a threshold, the sample will be labeled as "Cannot Determine". The
second uses the confidence of the model’s predictions instead of the output probabilities. Finally,
to calculate the confidence of a model’s prediction the method proposed by Polimis et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
was used.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>3.1. Validation and hyperparameter tuning</title>
        <p>
          To obtain robust metrics we follow the Stratified KFolds cross-validation method with 10 folds.
The Matthews correlation coeficient (MCC) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] was used as the evaluation metric. To evaluate
each model, the mean and standard deviation of the scores obtained in each fold was computed.
In addition, Optuna [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] framework was used for hyperparameter tuning. Table 12 and 23 shows
the values selected for the hyperparameters.
2For the rest of the values of the hyperparameters refer to the default in https://scikit-learn.org/stable/modules/
generated/sklearn.ensemble.RandomForestClassifier.html and https://xgboost.readthedocs.io/en/stable/python/
python_api.html#xgboost.XGBClassifier
3For the rest of the hyperparameters refer to the default values in https://neo4j.com/docs/graph-data-science/
current/machine-learning/node-embeddings/fastrp/ and https://neo4j.com/docs/graph-data-science/current/
machine-learning/node-embeddings/node2vec/
        </p>
        <p>Node2Vec FastRP
Node
attributes</p>
        <p>FastRP
optimized
n_estimators 132
min_samples_leaf 3
min_samples_split 2
max_depth 14
class_weight(1/2) 1.0/2.001</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Ensemble results</title>
        <sec id="sec-3-2-1">
          <title>Approach</title>
          <p>Node attributes
Node2Vec
FastRP</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>FastRP optimized</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Random Forest</title>
          <p>0.130 (0.054)
0.129 (0.061)
0.259 (0.063)
0.434 (0.071)</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>XGBoost</title>
          <p>0.156 (0.055)
0.115 (0.088)
0.301 (0.030)</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. “Cannot Determine” labels</title>
        <p>Figure 1 shows the variation of the MCC score when diferent thresholds are selected for the
FastRP optimized model. The graph on the right shows the results of the model’s confidence
in the prediction, while the graph on the left shows the results of the output probability. As
we can see, labeling samples as "Cannot Determine" did not improve the model performance.
Please note that the maximum value is always obtained at the maximum possible value of the
threshold. Hence, no sample is labeled as "Cannot Determine".</p>
        <p>0.4
0.3
The efectiveness of the pseudo-labeling was evaluated by comparing the MCC of FastRP
optimized model trained with labeled data only to the one trained with pseudo-labeling. For
this procedure, 10, 000 extra unlabeled nodes were randomly selected. A   of 0.7, and a
  of 0.15 were selected after manual experimentation. This process has been repeated
for each fold of a stratified KFold validating procedure with 31 iterations.</p>
        <p>At the end of the pseudo-labeling procedure carried during each fold, at least 95% of unlabeled
samples were used to train the model. However, a Kruskal-Wallis H-test p-value of 0.75 showed
that applying the pseudo-labeling procedure was unhelpful.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion and outlook</title>
      <p>This work presents a model for detecting COVID conspiracy theory spreaders online. Four
approaches were proposed: (i) Baseline model with node attributes only; (ii) representation
learning model using node2vec and FastRP to calculate node embeddings; (iii) Pseudo-labeling
with unlabeled data; (iv) Labeling nodes as ’cannot determine’ for low-confidence predictions.</p>
      <p>From our experimentation, it can be concluded that for our particular setup: (i) topology-based
models outperformed attribute-based ones; (ii) FastRP embeddings outperformed node2vec
due to its ability to consider node attributes and topology features; (iii) "Cannot determine"
labels were unhelpful, as the experiments show the same confidence distribution for correct
and incorrect predictions; (iv) finally, applying a pseudo-labeling procedure does not further
improve the performance of the model</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been supported by MICINN under FightDIS (PID2020-117263GB-I00); by
MCIN/AEI/10.13039/501100011033/ and European Union NextGenerationEU/PRTR for
XAIDisinfodemics (PLEC2021-007681) grant, by Comunidad Autónoma de Madrid under
S2018/TCS4566 grant, by European Comission under IBERIFIER - Iberian Digital Media Research and
FactChecking Hub (2020-EU-IA-0252); by Comunidad Autonoma de Madrid under grant
S2018/TCS4566 (CYNAMON: Cybersecurity, Network Analysis and Monitoring for the Next Generation
Internet); by the project PCI2022-134990-2 (MARTINI) of the CHISTERA IV Cofund 2021
program, funded by MCIN/AEI/10.13039/501100011033 and by the “European Union
NextGenerationEU/PRTR”; and by "Convenio Plurianual with the Universidad Politécnica de Madrid in
the actuation line of Programa de Excelencia para el Profesorado Universitario"</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Pogorelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maulana</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Langguth,</surname>
          </string-name>
          <article-title>Combining tweets and connections graph for fakenews detection at mediaeval</article-title>
          <year>2022</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <article-title>Representation learning: A review and new perspectives</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>35</volume>
          (
          <year>2013</year>
          )
          <fpage>1798</fpage>
          -
          <lpage>1828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <article-title>Inductive representation learning on large graphs</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          , Random forests,
          <source>Mach. Learn</source>
          .
          <volume>45</volume>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          . URL: https://doi.org/10.1023/A: 1010933404324. doi:
          <volume>10</volume>
          .1023/A:
          <fpage>1010933404324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          , C. Guestrin, XGBoost, in
          <source>: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM</source>
          ,
          <year>2016</year>
          . URL: https://doi.org/10.1145%
          <fpage>2F2939672</fpage>
          . 2939785. doi:
          <volume>10</volume>
          .1145/2939672.2939785.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Sagi</surname>
          </string-name>
          , L. Rokach,
          <article-title>Ensemble learning: A survey</article-title>
          ,
          <source>Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery</source>
          <volume>8</volume>
          (
          <year>2018</year>
          )
          <article-title>e1249</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , node2vec:
          <article-title>Scalable feature learning for networks</article-title>
          ,
          <source>in: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>855</fpage>
          -
          <lpage>864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F.</given-names>
            <surname>Sultan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiena</surname>
          </string-name>
          ,
          <article-title>Fast and accurate network embeddings via very sparse random projection</article-title>
          ,
          <source>in: Proceedings of the 28th ACM international conference on information and knowledge management</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>399</fpage>
          -
          <lpage>408</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          , E. Ferrara,
          <article-title>Graph embedding techniques, applications, and performance: A survey, Knowledge-Based Systems 151 (</article-title>
          <year>2018</year>
          )
          <fpage>78</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Rizve</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Duarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shah, In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning</article-title>
          ,
          <source>arXiv preprint arXiv:2101.06329</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gammerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vovk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>Learning by transduction</article-title>
          ,
          <source>vol uai'98</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Polimis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rokem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hazelton</surname>
          </string-name>
          ,
          <article-title>Confidence intervals for random forests in python</article-title>
          ,
          <source>Journal of Open Source Software</source>
          <volume>2</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Baldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brunak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chauvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A. F.</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <article-title>Assessing the accuracy of prediction algorithms for classification: an overview</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>16</volume>
          (
          <year>2000</year>
          )
          <fpage>412</fpage>
          -
          <lpage>424</lpage>
          . URL: https://doi.org/10.1093/bioinformatics/16.5.412. doi:
          <volume>10</volume>
          .1093/bioinformatics/16.5.412. arXiv:https://academic.oup.com/bioinformatics/article-pdf/16/5/412/476945/160412.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Akiba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yanase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koyama</surname>
          </string-name>
          ,
          <article-title>Optuna: A next-generation hyperparameter optimization framework</article-title>
          ,
          <source>in: Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>