<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Huertas-Tato);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>NLP-MisInfo-2023 - Abstract - Countering Malicious Content Moderation Evasion in Online Social Networks: Simulation and Detection of Word</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Camacho</string-name>
          <email>david.camacho@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Madrid</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Spain</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Álvaro Huertas-García</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alejandro Martín</string-name>
          <email>alejandro.martin@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier Huertas-Tato</string-name>
          <email>javier.huertas.tato@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Information Disorders, Leetspeak, Word camouflage, Multilingualism, Content Evasion</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Systems Engineering, Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>St. Ramiro de Maeztu, 28040</addr-line>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This research introduces novel methodologies and tools to combat content evasion in multilingual Natural Language Processing on social networks. A unique Python package, ”pyleetspeak”, is developed, ofering a customizable system for simulating multilingual content evasion through word camouflage techniques. The study also presents a synthetic multilingual dataset of camouflaged words, facilitating the training of models for camouflage detection. In a comparative analysis of various models, the multilingual MPNET-ideal model, pre-trained on an extended mSTSb dataset, outperforms other models in detecting camouflaged content across languages. The research underscores the utility of the tool in improving content moderation, enhancing online security, and serving as a potential data augmentation tool for AI systems. This work constitutes a significant contribution towards combating information disorders on social networks and sets the stage for further research in this field.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
Workshop
Proceedings
dataset of camouflaged words 2, and a multilingual Transformer-based model3 to identify various
word camouflage techniques and prevent content evasion over 20 languages 4. The eficacy of
multilingual pre-training in semantic similarity for enhancing such models is also explored.</p>
      <p>
        A novel system for simulating multilingual content evasion through word camouflage
techniques is developed based on literature references [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8">5, 6, 7, 8</xref>
        ] and strategies observed on social
media. This system includes three unique modules: LeetSpeaker, PunctuationCamouflage, and
InversionCamouflage, all of which have been embedded into the Python package “pyleetspeak”.
The LeetSpeaker module uses ’leetspeak’, a character replacement system (i.e., “vaccination”
into “v@ccin@tion” or “v4ccin4tion”), while the PunctuationCamouflage module inserts
punctuation marks within words to confound content moderation algorithms (i.e., “COVID-19” is
transformed into “C.O.V.I.D.-1.9”). Lastly, the InversionCamouflage module scrambles words by
reversing the order of syllables (i.e., “Methodology” can be changed to “Me-do-tho-lo-gy).
      </p>
      <p>The “pyleetspeak”1 package also showcases its utility as a data generator, which uses KeyBERT
to extract semantically relevant words, apply camouflage methods, and generate data annotated
in Spacy format. The data is tagged with four entities representing diferent camouflage methods,
and a dictionary detailing parameters applied to each instance ensures process interpretability.</p>
      <p>
        An experimental protocol was designed to address the problem of word camouflage in
multilingual content. The protocol starts with the creation of a synthetic multilingual dataset
from non-camouflaged text data. This dataset, curated from various sources (OPUS
NewsCommentary [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], OPUS ParaCrawl [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], TED2020 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and WikiMatrix [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]), is used to train
models to recognize camouflaged entities in monolingual and multilingual contexts. After
camouflaging, the data is divided into training, validation, and testing sets, ensuring the camouflage
stems exclusively from our generator tool.
      </p>
      <p>
        To handle the task of word camouflage detection, a variety of models is employed. These
include paraphrase-multilingual-mpnet-base-v2 (MPNET-base) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
mstsb-paraphrase-multilingualmpnet-base-v2 (MPNET-ideal)3 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], bloomz-560m [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], xlm-roberta-base [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and
bert-basemultilingual-cased [16]. These models are fine-tuned using the Spacy interface, establishing a
comprehensive training architecture for the task at hand.
      </p>
      <p>The developed model and the curated dataset are made publicly available for broader research
and application. The open accessibility of these resources promotes transparency, encourages
reproducibility, and potentially enables further advancements in the field of content evasion
detection.</p>
      <p>In a research efort to develop the best multilingual NER model for word camouflage
detection, the study conducted various experiments and presented impressive findings. The
most striking result was the performance of the MPNET-ideal model, a version of the MPNET
that was pre-trained using the semantic textual similarity task with a multilingual extended
mSTSb dataset. The MPNET-ideal outperformed all other trained multilingual models across
most datasets, demonstrating its superiority in word camouflage detection. Specifically, the
model exhibited improved performance over the monolingual baseline models, with the most
substantial enhancement in Italian language detection where the F1 score went from 0.7061 to
2https://github.com/Huertas97/XX_NER_WordCamouflage
3https://huggingface.co/Huertas97/xx_LeetSpeakNER_mstsb_mpnet
4ar, az, da, de, el, en, es, fi, fr, hu, id, it, kk, nb, ne, nl, pt, ro, ru, sl, sv, tg, tr
0.8913.</p>
      <p>The models were also evaluated across diferent camouflage techniques, revealing that
detection of inversion camouflage was more challenging compared to punctuation or leetspeak
camouflage. The results suggested that the MPNET-ideal multilingual model could accurately
detect camouflaged entities across multiple languages and diferent types of text with high
precision and recall. It was further demonstrated that the model could efectively diferentiate
between diferent camouflage techniques and handle a variety of languages. For instance, the
confusion matrices revealed the dificulty of diferentiating “MIX” entities from “LEETSPEAK” or
“PUNCT_CAMO” entities due to the mixed elements, but the MPNET-ideal model still performed
admirably.</p>
      <p>Finally, the research validated the model’s performance using an external tool, AugLy [17].
Though designed for monolingual data augmentation, AugLy could apply transformations that
resembled camouflage techniques, making it an apt tool for external validation. The study
discovered that the model could accurately detect new camouflage strategies, such as
upsidedown letters or emoticons in place of letters. However, it struggled to detect modifications in
less semantically meaningful words like articles and pronouns. This shortcoming highlighted
the importance of focusing on semantically meaningful words when dealing with camouflage
detection. Overall, the MPNET-ideal model’s validation results underlined its impressive
capabilities in detecting various camouflage techniques, cementing its position as an efective tool
for multilingual word camouflage detection.</p>
      <p>To conclude, this research ofers significant insights and practical solutions for addressing
content evasion in multilingual Natural Language Processing. The novel tool “pyleetspeak”
and the robust multilingual NER camouflage detection model efectively enhance content
moderation and improve online security. The tool’s utility extends beyond its immediate
application, indicating its potential in data augmentation for AI systems and future expansion
to other languages and evasion strategies.</p>
      <p>This summary encapsulates the key findings from [ 18] research paper, highlighting the
development and utilization of a synthetic multilingual dataset and the Python package “pyleetspeak”
for addressing the issue of content evasion in social networks. The original article presents more
in-depth insights and discusses the broader impacts of word camouflage on content moderation.
This research represents a significant stride towards combating information disorders on social
networks and provides a solid foundation for future research in this crucial area.
Acknowledgments
This research has been supported by the Spanish Ministry of Science and Education under
FightDIS (PID2020-117263GB-I00) and XAI-Disinfodemics (PLEC2021-007681) grants, by
Comunidad Autónoma de Madrid under S2018/ TCS-4566 (CYNAMON), by BBVA Foundation
grants for scientific research teams SARS-CoV-2 and COVID-19 under the grant: ” CIVIC:
Intelligent characterisation of the veracity of the information related to COVID-19”, and by IBERIFIER
(Iberian Digital Media Research and Fact-Checking Hub), funded by the European Commission
under the call CEF-TC-2020-2, grant number 2020-EU-IA-0252. Finally, David Camacho has
been supported by the Comunidad Autónoma de Madrid under ”Convenio Plurianual with
the Universidad Politécnica de Madrid in the actuation line of Programa de Excelencia para el
Profesorado Universitario”
M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised Cross-lingual Representation Learning
at Scale, 2019. doi:10.48550/ARXIV.1911.02116.
[16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1, Association for Computational Linguistics, Minneapolis,
Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
[17] Z. Papakipos, J. Bitton, Augly: Data augmentations for robustness, 2022.</p>
      <p>arXiv:2201.06494.
[18] Álvaro Huertas-García, A. Martín, J. Huertas-Tato, D. Camacho, Countering malicious
content moderation evasion in online social networks: Simulation and detection of word
camouflage, Applied Soft Computing 145 (2023) 110552. doi: https://doi.org/10.1016/
j.asoc.2023.110552.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Fagan</surname>
          </string-name>
          ,
          <article-title>Optimal social media content moderation and platform immunities</article-title>
          ,
          <source>European Journal of Law and Economics</source>
          <volume>50</volume>
          (
          <year>2020</year>
          )
          <fpage>437</fpage>
          -
          <lpage>449</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10657-020-09653-7.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sharevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Alsaadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jachim</surname>
          </string-name>
          , E. Pieroni,
          <article-title>Misinformation warnings: Twitter's soft moderation efects on covid-19 vaccine belief echoes</article-title>
          ,
          <source>Computers &amp; Security</source>
          <volume>114</volume>
          (
          <year>2022</year>
          )
          <article-title>102577</article-title>
          . doi:https://doi.org/10.1016/j.cose.
          <year>2021</year>
          .
          <volume>102577</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gerrard</surname>
          </string-name>
          ,
          <article-title>Beyond the hashtag: Circumventing content moderation on social media</article-title>
          ,
          <source>New Media &amp; Society</source>
          <volume>20</volume>
          (
          <year>2018</year>
          )
          <fpage>4492</fpage>
          -
          <lpage>4511</lpage>
          . doi:
          <volume>10</volume>
          .1177/1461444818776611.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Martín</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huertas-Tato</surname>
          </string-name>
          , Á. Huertas-García,
          <string-name>
            <given-names>G.</given-names>
            <surname>Villar-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Camacho</surname>
          </string-name>
          ,
          <article-title>FacTeRCheck: Semi-automated fact-checking through semantic similarity and natural language inference</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>251</volume>
          (
          <year>2022</year>
          )
          <article-title>109265</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.knosys.
          <year>2022</year>
          .
          <volume>109265</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kavanagh</surname>
          </string-name>
          ,
          <article-title>Bridge the generation gap by decoding leetspeak, Inside the Internet 12 (</article-title>
          <year>2005</year>
          )
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Romero-Vicente</surname>
          </string-name>
          ,
          <article-title>Word camouflage to evade content moderation</article-title>
          ,
          <year>2021</year>
          . URL: https: //www.disinfo.eu/publications/word-camouflage
          <article-title>-to-evade-content-moderation/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Blashki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <article-title>Game geek's goss: linguistic creativity in young males within an online university forum</article-title>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          ,
          <article-title>Gamespeak for n00bs - a linguistic and pragmatic analysis of gamers' language</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Graz,
          <year>2013</year>
          . URL: https://unipub.uni-graz.at/obvugrhs/content/ titleinfo/231890?lang=en.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <article-title>Parallel data, tools and interfaces in OPUS</article-title>
          ,
          <source>in: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          , Istanbul, Turkey,
          <year>2012</year>
          , pp.
          <fpage>2214</fpage>
          -
          <lpage>2218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Making monolingual sentence embeddings multilingual using knowledge distillation, arXiv preprint (</article-title>
          <year>2020</year>
          ). doi:arXiv:
          <year>2004</year>
          .09813.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          ,
          <article-title>Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia</article-title>
          ,
          <year>2019</year>
          . arXiv:
          <year>1907</year>
          .05791.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          , T.-Y. Liu,
          <article-title>Mpnet: Masked and permuted pre-training for language understanding</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2004</year>
          .09297.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Huertas-García</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Huertas-Tato</surname>
            ,
            <given-names>A. Martín</given-names>
          </string-name>
          <string-name>
            <surname>García</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Camacho</surname>
          </string-name>
          ,
          <article-title>Countering Misinformation Through Semantic-Aware Multilingual Models</article-title>
          ,
          <source>in: Intelligent Data Engineering and Automated Learning - IDEAL 2021</source>
          , Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>312</fpage>
          -
          <lpage>323</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -91608-4_
          <fpage>31</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-X.</given-names>
            <surname>Yong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Almubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Albanie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alyafeai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Raf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <article-title>Crosslingual generalization through multitask ifnetuning</article-title>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.2211.01786.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>