-

Utilizing Natural Honeypots for E Labeling Astroturfer Pro les ciently

Jonathan Schler

schler@hit.ac.il 1

Elisheva Bonchek-Dokow

elishevabd@edu.aac.ac.il 0

Tomer Vainstein

Moshe Gotam

Mike Teplitsky

1 0 Ashkelon Academic College , Israel 1 Holon Institute of Technology , Israel

Astrotur ng is the practice of using a fake online social media (OSM) pro le in order to in uence public opinion, while giving the impression that the pro le belongs to an authentic human user. In attempting to train a classi er for discriminating between authentic users and astroturfers, a labeled dataset must rst be arranged. The labeling is generally done manually, by human judges, on a collection of pro les garnered from the social media network. However, the fact that any randomly collected set of pro les will statistically contain a small proportion of astroturfers, renders this process ine cient: a lot of time and e ort is invested on manually labeling lots of data, while producing only a small set of astroturfer pro les. We present here a method for quickly and efciently collecting a data set for manual labeling, with a high percent of astroturfers.

Astrotur ng Facebook E cient Labeling

Along with the growing use of social networks, so has the realization of its dangers grown, albeit arguably at a slower pace. Adversarial use of online social media (OSM) pro les comes forth in various scenarios, such as social ("Cyber Bullying"), nancial, health (disseminating anti-vax pseudo scienti c claims). We focus here on the political scenario. The practice of attempting to create a false pretense of wide public support of a speci c candidate in the political eld, by using a fake OSM pro le, is known as astrotur ng. [ 1 ] de nes astrotur ng as the process of seeking electoral victory or legislative relief for grievances by helping political actors nd and mobilize a sympathetic public, and is designed to create the image of public consensus where it does not necessarily exist. [ 2 ] analyzes the phenomenon of astrotur ng as it appears in its digital form on OSM platforms. They de ne digital astrotur ng as a form of manufactured, deceptive Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and strategic top-down activity on the Internet initiated by political actors that mimics bottom-up activity by autonomous individuals. The phenomenon of fake pro les involved in political discussions on OSM has attracted the attention of several research groups. Some of these are funded by democratic governments, due to the growing awareness that such pro les pose a threat to the cornerstones of our democratic society, namely, trust and faith. A recent survey [ 4 ] on controlling astrotur ng on the Internet refers to it as one of the most impactful threats online today. A preliminary step in the process of building an automatic classi er for identifying astroturfers, as in any machine learning problem, is that of creating a training set. The labeling itself is generally done manually, by human judges trained for the task. However, experience has taught us that the proportion of astroturfers found in any random collection of pro les is very low. Three judges employed for the task of labeling a random set of pro les, received detailed guidelines with criteria for labeling a pro le as astroturfer. When presented with a set of 400 randomly selected pro les, their labeling resulted in only 9 (2.25%) pro les being labeled as astroturfers. Such a low gure creates an unbalanced dataset. This, in and of itself, could be tolerated and handled by the algorithms in a satisfactory way, and by starting out with a large enough dataset to begin with. However, the amount of time and e ort expended for this task is unreasonable, rendering the process highly ine cient. What is needed here is a method which allows for a higher ratio of labeled astroturfers produced.

One approach for solving this problem is to harvest suspicious accounts by using what are known as \social honeypots" [ 3 ] [ 5 ]. These are fake pro les, set up by the research team with characteristics that are expected to lure the type of pro les which are being targeted. The pro les which political astroturfers are attracted to are naturally those that belong to political candidates. Obviously, setting up a honeypot in the form of a fake political candidate would not be a viable tactic. We introduce here a simple and straightforward technique, arising from insight gained over several months of collecting and analyzing pro les, posts and comments, in the Israeli political Facebook scene.

Our dataset consists of close to half a million Facebook pro les collected over a span of 15 months, from close to four million comments on some twenty political candidates' posts. During those 15 months, the political system in Israel went through one upheaval after another, with three rounds of elections attempting to reach a decree that would enable the formation of a government. This rare situation created for us a rich dataset, with many novel attributes useful for identifying astroturfers. 2

Utilizing the Innate Honeypot Nature of Political Pro les As mentioned, creating arti cial honeypots in order to attract political astroturfers would not work. However, we came to realize that the existing pro les of the political candidates are in fact natural honeypots, in that they attract, by their very nature, the astroturfer pro les. This in itself is not enough, since authentic users are also attracted to these honeypots, and as mentioned, the ratio of labeled astroturfers to authentic users who interact with political pro les, is very low. This is where one of the many features we studied presented itself as useful for the task at hand. We noticed that astroturfers are quick to pounce on fresh posts, rendering the pool of the rst several commentors rich with astroturfers. The method we suggest consists of collecting pro les which are among the rst several commentors on the posts, and presenting these pro les to the human judges for labeling. We posited that the percent of astroturfers labeled would be signi cantly higher than for a random collection of pro les. We present here data to support this thesis.

We described above the motivation for this study: having only 2.25% of the training set labeled as astroturfers, which is not much to show for the huge amount of e ort which went into the task. Our rst attempt was based on the realization that the proportion of comments astroturfers produce is much higher than their own proportion in the population. Therefore, instead of sampling the pro les, we sampled the comments, and chose those pro les which produced the sampled comments. This approach indeed brought forth a higher proportion of astroturfers labeled|46 of 400 pro les (11.5%). This is better, but not enough.

Our next attempt was to apply a preliminary manual sifting of pro les, creating two sets: one of suspected pro les and one of innocent looking ones. These two sets were then presented to the judges for labeling, using the same guidelines. The preliminary sifting proved useful|of 364 pro les, 76 (20.88%) were labeled as astroturfers. This is much better, however it required extra e ort expended for the preliminary sifting.

However, the signi cant growth in the proportion of labeled astroturfers was achieved by what we present here as our method: instead of choosing from all comments, we chose only from the top 10 comments. Apparently, the tendency to comment as quickly as possible is driven by astroturfers' high motivation to disseminate and promote their agenda immediately, whenever the opportunity presents itself, in the form of a new post. This insight reveals the innate honeypot nature of political pro les|there is no need to create fake pro les in order to lure the astroturfers. These pro les already exist, they just need to be utilized in the right way. We tested this hypothesis by targeting those pro les responsible for the rst 10 comments on 50 randomly chosen posts. The pro les were targeted by randomly sampling with returns from the pool of 500 comments (top 10 from 50 posts). It is worth noting that these 500 comments belonged to some 300 commentors|a fact which can be attributed to the nature of such quick-tocomment users, who also have a tendency to comment more than once. Labeling this pool of pro les uncovered 25% of them as astroturfers.

Figure 1 summarizes and compares all four methods|Random Baseline (where pro les were selected randomly from among all pro les), Chosen Posts (where pro les were selected by choosing randomly from comments rather than from pro les), Past Selection (after preliminary sifting, as done in our past research) and Top 10 (which is the proposed method of choosing from among the rst 10 comments). Clearly the Top 10 has the highest ratio of labeled astroturfers. Not far behind is the Past Selection, however recall that the manual labour invested in this result was far greater, so that the ratio of pro les produced to e ort expended is even more pronounced than what the gure shows. These are only preliminary results. The number of rst comments we targeted in this study was 10, however this is no magic number. Further studies should compare results for di erent values of K, using the rst K comments. It is important to stress again that the human annotators received training with detailed criteria and guidelines before they began labeling the data. These guidelines did not change, throughout the various pro le sets presented for labeling. In addition, it should be kept in mind that this method is pertinent only to the rst stage of preparing training data for the classi cation algorithms. The correct choice of algorithms and parameters can then be found, applied and analyzed, in order to create a high performance classi er. A large set of labeled data is a critical resource, and a method for achieving such a set without wasting precious time and manual e ort is valuable. We intend to make this valuable repository of validated astroturfer pro les available for the use of the research community.

1. Howard , P.N. : New Media Campaigns and the Managed Citizen . Cambridge University Press, New York, NY ( 2005 )

2. Kovic , M. , Rauch

eisch

, A., Sele , M. , Caspar , C. : Digital astrotur ng in politics: De nition, typology, and countermeasures . Studies in Communication Sciences (1) ( 2018 )

3. Kyumin , L. , Eo , B.D. , Caverlee , J.: Seven months with the devils: A long-term study of content polluters on twitter ( 2011 )

4. Mahbub , S. , Pardede , E. , Kayes , A. , Rahayu , W. : Controlling astrotur ng on the internet: a survey on detection techniques and research challenges . International Journal of Web and Grid Services 15 ( 2 ), 139 { 158 ( 2019 )

5. Stringhini , G. , Kruegel , C. , Vigna , G.: Detecting spammers on social networks . In: Proceedings of the 26th annual computer security applications conference . pp. 1 { 9 ( 2010 )