<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Hidden Social Bots via Genetic Similarity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edoardo Allegrini</string-name>
          <email>allegrini@di.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edoardo Di Paolo</string-name>
          <email>dipaolo@di.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marinella Petrocchi</string-name>
          <email>marinella.petrocchi@iit.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angelo Spognardi</string-name>
          <email>spognardi@di.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Istituto di Informatica e Telematica, CNR</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Scuola IMT Alti Studi Lucca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social media platforms face an ongoing challenge in combating the proliferation of social bots, automated accounts that are also known to distort public opinion and support the spread of disinformation. Over the years, social bots have evolved greatly, often becoming indistinguishable from real users, and more recently, families of bots have been identified that are powered by Large Language Models to produce content for posting. We suggest an idea to classify social users as bots or not using genetic similarity algorithms. These algorithms provide an adaptive method for analyzing user behavior, allowing for the continuous evolution of detection criteria in response to the ever-changing tactics of social bots. Our proposal involves an initial clustering of social users into distinct macro species based on the similarities of their timelines. Macro species are then classified as either bot or genuine based on genetic characteristics. The preliminary idea we present, once fully developed, will allow existing detection applications based on timeline equality alone to be extended to detect bots. By incorporating new metrics, our approach will systematically classify non-trivial accounts into appropriate categories, efectively peeling back layers to reveal non-obvious species.</p>
      </abstract>
      <kwd-group>
        <kwd>Social bot detection</kwd>
        <kwd>Bioinformatics</kwd>
        <kwd>Social Network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The digital age has brought with it an unprecedented proliferation of accounts on social platforms,
resulting in a diverse and complex ecosystem. Of particular note are automated accounts, commonly
known as social bots [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. These digital artifacts have received increasing attention, not only because of
their ubiquity, but also because of the role they often play as vehicles for misinformation and propaganda
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Recently, families of bots have been heuristically identified that use large language models to
produce content for publication. The proliferation of these advanced bots raises concerns about the
inability of researchers to detect them [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        One of the most important strands of research began with the realization that bots, programmed to
pursue specific goals, often operate in a coordinated manner and exhibit similar behaviors. In particular,
one modeling and detection technology that was notably relevant was that based on digital DNA
[
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]. Digital DNA is a string of characters, each of which associated with a specific account action,
representing the timeline of the account. This modeling technique has been used in several studies, see,
e.g., [
        <xref ref-type="bibr" rid="ref10 ref7 ref8 ref9">7, 8, 9, 10</xref>
        ].
      </p>
      <p>Based on the intuition that accounts of the same type behave similarly, if not exactly the same, we
propose a method for classifying bot accounts that may be mistaken for real accounts.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Method</title>
      <p>The classification approach we propose has been designed for Twitter/X users, but minor adjustments
can easily adapt it for use on other social media platforms. Figure 1 shows the scheme of the procedure.
30/08/2024, 18:43 poster_version.drawio.svg
D
N
A
E
n
c
o
d
e
r</p>
      <sec id="sec-2-1">
        <title>Preprocessed Dataset</title>
        <p>gSpamBot
gGenuine
species
species</p>
        <sec id="sec-2-1-1">
          <title>Species clustering via LCS</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Classification via Genetic Similarity</title>
          <p>species
gSpamBot
gGenuine
Species
Species
Species
species
Pareto</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Earliest Arrangement</title>
        <p>gSpamBot
gGenuine
species
species
Species left
species
species
species
species
Species left
species
species
species
species</p>
        <sec id="sec-2-2-1">
          <title>Species clustering</title>
          <p>Digital DNA The first phase involves previ-apLrCoScessing usersSbpeycieenscodinSgpetchieeisr online behavior using
Digital DNA. Digital DNA aims to compactly represent the behavior of a social account, using a sequence
of characters from an alphabet  , such as the following madeSpoefcitehsreeSpcehcaiersacters:
Species
 → plain tweet
 t3ype = {  → retweet }</p>
          <p>→ reply</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Case 2.</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Case 1.</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Case 3.</title>
          <p>
            Clustering u~:sGeenertic sSimiilanrity to species After the pre~-: pGerneotic Scimeilasritysing stage, users are grou~p: GeendeticiSinmilatrioty macro species,
: weightedLCS
based on the concept of the LongSgpeamsBtoCtommon Substring (LCS)g[S1p1am]B.oTthe LCS of two or more stgrSipnamgBsot
is the longest string that is ~a substring of all of them. Figu~re 2 shows an example = o3.f25an LC~ S curve,
where the abscissasepecaierse theS = g0.1roup~s Mo=f0.2k5 users, andspetchiees ordSin=1a.0 te is~ tMh =e0.25length of thspeeciLesCS foS r=0.3each~oMf= 0.25
the user groups. Considering a dataset of  users paired with their digital DNA sequences, the LCS
computation is performed in linegaGerntuiimnee [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] between  users, whgeGreen ui∈ne{2, ...,  } . LCS is an indicagtGoernuoinfe
behavioral similarity within the user group: When the sub-DNAs in the LCS curve are of approximately
constant leng:twheig,htewdLCSe can deduce that the users wweigihttehdLCSthese : sequences have simi~l a:Grenebtic eSimhilaraityvior. Conversely,
if we observe a significant drop giSnpatmhBoetLCS curve, we know thatgStphameBobtehavior : owefightendLCeSwly added ugsSeparmsBot
difers considerably from that of the users in the previous group. As an example, the =r3e.2d5 circl~e in the
ifgure marks the poisnpectiewshere the curve exhibits a sisgpenciieficsant drop. To identify and cluspesctieesr useS r=1s.25into~ M = 1.0
species, the first sig n=0ific.0a1nt drop in the LCS curve is d =e1t0e.0cted, which marks a behavioral shift. Users
associated with that drop are grouped into a new species and subsequently removed. The LCS curve is
gGenuine
then recomputed for the remaingGinenguiunesers that have not been assigned to a cluster yet. This procgeGesnsuiisne
repeated, progressively segmenting and grouping all users in the dataset into distinct clusters called
species.
Spambot and genuine accounts: first arrangement From the macro clusters obtained, two key
groups are constructed: gSpamBot, by selecting among the species the one demonstrating evident
bot-like behavior, and gGenuine, with a high predominance of genuine users. Later, these two groups
will be populated by the remaining users, by adopting specific algorithms developed to measure the
genetic similarity.
          </p>
          <p>To establish the initial gSpamBot and gGenuine groups, the following idea is applied: the LCS of
a species represents the users within it; therefore, a group with a long LCS indicates similar social
behavior, possibly an indicator of social bots. On the other side, a group with a short LCS implies
a diversity in the social behavior of its members. The construction of the initial gSpamBot group is
inspired by the Pareto principle, which seeks to determine the subset of individuals that have the
most significant impact on the overall community. Based on the original dataset, users have now been
categorized into three groups: gSpamBot, which comprises a considerable number of users exhibiting
very similar behavior, strongly indicating that they are social bots; gGenuine, formed by merging
species that demonstrate human-like actions; and lastly, species that do not fall into either gSpamBot or
gGenuine and are therefore unlabeled at this stage.</p>
          <p>Classification of species using genetic similarity To classify the unlabeled species (those colored
gray in Figure 1), whose users are not immediately classifiable as bots or not, we propose an algorithm
that uses custom genetic similarity metrics. The idea is using a sequence alignment algorithm —a
wellestablished technique in bioinformatics— to compare the LCS of each unlabeled species with those of the
two primary groups. After this alignment, our algorithm introduces a structured classification process.
The process involves two key steps: first, calculating a similarity score based on the alignment of LCS
sequences between species; and second, evaluating a new metric that considers the relative similarity
of DNA sequences within a species and the size of the population contributing to that similarity. By
integrating these procedures, we believe that our approach can efectively label the previously unlabeled
species as either gSpamBot or gGenuine, thereby completing the account classification process.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusions</title>
      <p>In this short paper, we presented the idea of a new approach to social bot detection that we hope will
not only achieve efective classification, but also maintain a transparent decision-making process.</p>
      <p>We plan to implement and test our proposed classifier using well-established bot repositories (see
the ones published on the site of the OSOME research unit at Indiana University) as well as recently
discovered datasets where social bots use Large Language Models (LLMs) to write their posts.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work is partially supported by project SERICS (PE00000014) under the NRRP MUR program funded
by the EU - NGEU; by project re-DESIRE (DissEmination of ScIentific REsults 2.0), funded by IIT-CNR;
by project ‘Prebunking: predicting and mitigating coordinated inauthentic behaviors in social media’,
funded by Sapienza University of Rome.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Varol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Menczer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Flammini</surname>
          </string-name>
          ,
          <article-title>The rise of social bots</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>59</volume>
          (
          <year>2016</year>
          )
          <fpage>96</fpage>
          -
          <lpage>104</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cresci</surname>
          </string-name>
          ,
          <article-title>A decade of social bot detection</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>72</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L.</given-names>
            <surname>Ciampaglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Varol</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.-C. Yang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Flammini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Menczer</surname>
          </string-name>
          ,
          <article-title>The spread of lowcredibility content by social bots</article-title>
          ,
          <source>Nature Communications</source>
          <volume>9</volume>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>K.-C. Yang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Menczer</surname>
          </string-name>
          ,
          <article-title>Anatomy of an ai-powered malicious social botnet</article-title>
          ,
          <source>Journal of Quantitative Description: Digital Media</source>
          <volume>4</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cresci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Pietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petrocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spognardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tesconi</surname>
          </string-name>
          ,
          <article-title>Dna-inspired online behavioral modeling and its application to spambot detection</article-title>
          ,
          <source>IEEE Intell. Syst</source>
          .
          <volume>31</volume>
          (
          <year>2016</year>
          )
          <fpage>58</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cresci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Di</given-names>
            <surname>Pietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petrocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spognardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tesconi</surname>
          </string-name>
          ,
          <article-title>Social fingerprinting: Detection of spambot groups through dna-inspired behavioral modeling</article-title>
          ,
          <source>IEEE Transactions on Dependable and Secure Computing</source>
          <volume>15</volume>
          (
          <year>2018</year>
          )
          <fpage>561</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Di Paolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petrocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spognardi</surname>
          </string-name>
          ,
          <article-title>From online behaviours to images: A novel approach to social bot detection</article-title>
          ,
          <source>in: Computational Science</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>607</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pasricha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <article-title>Detecting bot behaviour in social media using digital dna compression</article-title>
          ,
          <source>in: Artificial Intelligence and Cognitive Science</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gilmary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Venkatesan</surname>
          </string-name>
          ,
          <article-title>Entropy-based automation detection on twitter using dna profiling</article-title>
          ,
          <source>SN Computer Science</source>
          <volume>4</volume>
          (
          <year>2023</year>
          )
          <fpage>847</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <article-title>A hybrid framework for bot detection on twitter: Fusing digital dna with bert</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>82</volume>
          (
          <year>2023</year>
          )
          <fpage>30831</fpage>
          -
          <lpage>30854</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arnold</surname>
          </string-name>
          , E. Ohlebusch,
          <article-title>Linear time algorithms for generalizations of the longest common substring problem</article-title>
          ,
          <source>Algorithmica</source>
          <volume>60</volume>
          (
          <year>2011</year>
          )
          <fpage>806</fpage>
          -
          <lpage>818</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>