<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Utilizing Natural Honeypots for E Labeling Astroturfer Pro les ciently</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonathan Schler</string-name>
          <email>schler@hit.ac.il</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisheva Bonchek-Dokow</string-name>
          <email>elishevabd@edu.aac.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomer Vainstein</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moshe Gotam</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mike Teplitsky</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ashkelon Academic College</institution>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Holon Institute of Technology</institution>
          ,
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Astrotur ng is the practice of using a fake online social media (OSM) pro le in order to in uence public opinion, while giving the impression that the pro le belongs to an authentic human user. In attempting to train a classi er for discriminating between authentic users and astroturfers, a labeled dataset must rst be arranged. The labeling is generally done manually, by human judges, on a collection of pro les garnered from the social media network. However, the fact that any randomly collected set of pro les will statistically contain a small proportion of astroturfers, renders this process ine cient: a lot of time and e ort is invested on manually labeling lots of data, while producing only a small set of astroturfer pro les. We present here a method for quickly and efciently collecting a data set for manual labeling, with a high percent of astroturfers.</p>
      </abstract>
      <kwd-group>
        <kwd>Astrotur ng Facebook E cient Labeling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Along with the growing use of social networks, so has the realization of its
dangers grown, albeit arguably at a slower pace. Adversarial use of online social
media (OSM) pro les comes forth in various scenarios, such as social ("Cyber
Bullying"), nancial, health (disseminating anti-vax pseudo scienti c claims).
We focus here on the political scenario. The practice of attempting to create a
false pretense of wide public support of a speci c candidate in the political eld,
by using a fake OSM pro le, is known as astrotur ng. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] de nes astrotur ng
as the process of seeking electoral victory or legislative relief for grievances by
helping political actors nd and mobilize a sympathetic public, and is designed
to create the image of public consensus where it does not necessarily exist. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
analyzes the phenomenon of astrotur ng as it appears in its digital form on OSM
platforms. They de ne digital astrotur ng as a form of manufactured, deceptive
Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
and strategic top-down activity on the Internet initiated by political actors that
mimics bottom-up activity by autonomous individuals. The phenomenon of fake
pro les involved in political discussions on OSM has attracted the attention of
several research groups. Some of these are funded by democratic governments,
due to the growing awareness that such pro les pose a threat to the cornerstones
of our democratic society, namely, trust and faith. A recent survey [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] on
controlling astrotur ng on the Internet refers to it as one of the most impactful
threats online today. A preliminary step in the process of building an automatic
classi er for identifying astroturfers, as in any machine learning problem, is that
of creating a training set. The labeling itself is generally done manually, by
human judges trained for the task. However, experience has taught us that the
proportion of astroturfers found in any random collection of pro les is very low.
Three judges employed for the task of labeling a random set of pro les, received
detailed guidelines with criteria for labeling a pro le as astroturfer. When
presented with a set of 400 randomly selected pro les, their labeling resulted in
only 9 (2.25%) pro les being labeled as astroturfers. Such a low gure creates an
unbalanced dataset. This, in and of itself, could be tolerated and handled by the
algorithms in a satisfactory way, and by starting out with a large enough dataset
to begin with. However, the amount of time and e ort expended for this task is
unreasonable, rendering the process highly ine cient. What is needed here is a
method which allows for a higher ratio of labeled astroturfers produced.
      </p>
      <p>
        One approach for solving this problem is to harvest suspicious accounts by
using what are known as \social honeypots" [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These are fake pro les, set
up by the research team with characteristics that are expected to lure the type
of pro les which are being targeted. The pro les which political astroturfers are
attracted to are naturally those that belong to political candidates. Obviously,
setting up a honeypot in the form of a fake political candidate would not be a
viable tactic. We introduce here a simple and straightforward technique, arising
from insight gained over several months of collecting and analyzing pro les, posts
and comments, in the Israeli political Facebook scene.
      </p>
      <p>Our dataset consists of close to half a million Facebook pro les collected over
a span of 15 months, from close to four million comments on some twenty political
candidates' posts. During those 15 months, the political system in Israel went
through one upheaval after another, with three rounds of elections attempting
to reach a decree that would enable the formation of a government. This rare
situation created for us a rich dataset, with many novel attributes useful for
identifying astroturfers.
2</p>
      <p>Utilizing the Innate Honeypot Nature of Political
Pro les
As mentioned, creating arti cial honeypots in order to attract political
astroturfers would not work. However, we came to realize that the existing pro les of
the political candidates are in fact natural honeypots, in that they attract, by
their very nature, the astroturfer pro les. This in itself is not enough, since
authentic users are also attracted to these honeypots, and as mentioned, the ratio
of labeled astroturfers to authentic users who interact with political pro les, is
very low. This is where one of the many features we studied presented itself as
useful for the task at hand. We noticed that astroturfers are quick to pounce on
fresh posts, rendering the pool of the rst several commentors rich with
astroturfers. The method we suggest consists of collecting pro les which are among
the rst several commentors on the posts, and presenting these pro les to the
human judges for labeling. We posited that the percent of astroturfers labeled
would be signi cantly higher than for a random collection of pro les. We present
here data to support this thesis.</p>
      <p>We described above the motivation for this study: having only 2.25% of the
training set labeled as astroturfers, which is not much to show for the huge
amount of e ort which went into the task. Our rst attempt was based on the
realization that the proportion of comments astroturfers produce is much higher
than their own proportion in the population. Therefore, instead of sampling the
pro les, we sampled the comments, and chose those pro les which produced the
sampled comments. This approach indeed brought forth a higher proportion of
astroturfers labeled|46 of 400 pro les (11.5%). This is better, but not enough.</p>
      <p>Our next attempt was to apply a preliminary manual sifting of pro les,
creating two sets: one of suspected pro les and one of innocent looking ones. These
two sets were then presented to the judges for labeling, using the same guidelines.
The preliminary sifting proved useful|of 364 pro les, 76 (20.88%) were labeled
as astroturfers. This is much better, however it required extra e ort expended
for the preliminary sifting.</p>
      <p>However, the signi cant growth in the proportion of labeled astroturfers was
achieved by what we present here as our method: instead of choosing from all
comments, we chose only from the top 10 comments. Apparently, the tendency
to comment as quickly as possible is driven by astroturfers' high motivation to
disseminate and promote their agenda immediately, whenever the opportunity
presents itself, in the form of a new post. This insight reveals the innate honeypot
nature of political pro les|there is no need to create fake pro les in order to lure
the astroturfers. These pro les already exist, they just need to be utilized in the
right way. We tested this hypothesis by targeting those pro les responsible for
the rst 10 comments on 50 randomly chosen posts. The pro les were targeted
by randomly sampling with returns from the pool of 500 comments (top 10 from
50 posts). It is worth noting that these 500 comments belonged to some 300
commentors|a fact which can be attributed to the nature of such
quick-tocomment users, who also have a tendency to comment more than once. Labeling
this pool of pro les uncovered 25% of them as astroturfers.</p>
      <p>Figure 1 summarizes and compares all four methods|Random Baseline (where
pro les were selected randomly from among all pro les), Chosen Posts (where
pro les were selected by choosing randomly from comments rather than from
pro les), Past Selection (after preliminary sifting, as done in our past research)
and Top 10 (which is the proposed method of choosing from among the rst 10
comments). Clearly the Top 10 has the highest ratio of labeled astroturfers. Not
far behind is the Past Selection, however recall that the manual labour invested
in this result was far greater, so that the ratio of pro les produced to e ort
expended is even more pronounced than what the gure shows.
These are only preliminary results. The number of rst comments we targeted
in this study was 10, however this is no magic number. Further studies should
compare results for di erent values of K, using the rst K comments. It is
important to stress again that the human annotators received training with detailed
criteria and guidelines before they began labeling the data. These guidelines did
not change, throughout the various pro le sets presented for labeling. In
addition, it should be kept in mind that this method is pertinent only to the rst
stage of preparing training data for the classi cation algorithms. The correct
choice of algorithms and parameters can then be found, applied and analyzed,
in order to create a high performance classi er. A large set of labeled data is a
critical resource, and a method for achieving such a set without wasting precious
time and manual e ort is valuable. We intend to make this valuable repository
of validated astroturfer pro les available for the use of the research community.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          :
          <article-title>New Media Campaigns and the Managed Citizen</article-title>
          . Cambridge University Press, New York, NY (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Kovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauch</surname>
            <given-names>eisch</given-names>
          </string-name>
          , A.,
          <string-name>
            <surname>Sele</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caspar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Digital astrotur ng in politics: De nition, typology, and countermeasures</article-title>
          .
          <source>Studies in Communication Sciences (1)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kyumin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eo</surname>
            ,
            <given-names>B.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caverlee</surname>
          </string-name>
          , J.:
          <article-title>Seven months with the devils: A long-term study of content polluters on twitter (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mahbub</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pardede</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kayes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahayu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Controlling astrotur ng on the internet: a survey on detection techniques and research challenges</article-title>
          .
          <source>International Journal of Web and Grid Services</source>
          <volume>15</volume>
          (
          <issue>2</issue>
          ),
          <volume>139</volume>
          {
          <fpage>158</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Stringhini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kruegel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vigna</surname>
          </string-name>
          , G.:
          <article-title>Detecting spammers on social networks</article-title>
          .
          <source>In: Proceedings of the 26th annual computer security applications conference</source>
          . pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>