<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Characterizing community-changing users using text mining and graph machine learning on Twitter</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Albanese</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Esteban Feuerstein</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leandro Lombardi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Balenzuela</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto de Ciencias de la Computación, CONICET- Universidad de Buenos Aires</institution>
          ,
          <addr-line>Buenos Aires</addr-line>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Física de Buenos Aires (IFIBA)</institution>
          ,
          <addr-line>CONICET</addr-line>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Medallia</institution>
          ,
          <addr-line>Buenos Aires</addr-line>
          ,
          <country country="AR">Argentina</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Even though the Internet and social media have increased the amount of news and information people can consume, most users are only exposed to content that reinforces their positions and isolates them from other ideological communities. This environment has real consequences with great impact on our lives like severe political polarization, easy spread of fake news, political extremism, hate groups and the lack of enriching debates, among others. Therefore, encouraging conversations between diferent groups of users and breaking the closed community is of importance for healthy societies. In this paper, we characterize and study users who change their community on Twitter using natural language processing techniques and graph machine learning algorithms. In particular, we collected 9 million Twitter messages from 1.5 million users and constructed retweet networks. We identified their communities and topics of discussion associated with them. With this data, we present a machine learning framework for social media users classification which detects users that swing from their closed community to another one. A feature importance analysis in three Twitter polarized political datasets showed that these users have low values of PageRank, suggesting that changes in community are driven because their messages have no resonance in their original communities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Social Media</kwd>
        <kwd>text mining</kwd>
        <kwd>graph learning</kwd>
        <kwd>communities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        People with diferent political opinions and diverse backgrounds interact on social networks.
However, this diversity does not translate to enriching debates between users with diferent
profiles because they tend to cluster according to their beliefs, constituting homogeneous
communities known as echo chambers [16]. Aruguete et al. focused on the interaction between
users in political contexts and described how Twitter users frame political events by sharing
content exclusively with like-minded users forming two well-defined communities [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A
segregated partisan structure with extremely limited connection between communities of users
with diferent political orientations on the retweet networks can be found in multiple papers, in
diferent contexts and countries like, for instance, the 2010 U.S. congressional midterm elections
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the 2011 Canadian Federal Election [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] or tweets about the death of Venezuelan President
Hugo Chavez [20]. Well defined communities can also be found in diferent platforms [
        <xref ref-type="bibr" rid="ref10">10, 25</xref>
        ].
      </p>
      <p>
        Previous works showed the dramatic consequences and negative efects of closed communities
and echo chambers, which include the increase of negative discourse, hate speech and political
extremism [19], confirmation bias (i.e. the users tendency to seek out and receive information
that strengthens their preferred narrative) [25] and spreading of baseless rumors and fake news
[
        <xref ref-type="bibr" rid="ref12 ref9">12, 9</xref>
        ].
      </p>
      <p>In this paper, we propose a machine learning framework in order to characterize the users
who break this logic and change who they interact with: the community-changing users (i.e.
the Twitter users that first belonged to a well defined community and then start interacting
mostly with diferent users swinging to another community). Analyzing users that switch their
political community can ofer valuable insights into the complex dynamics of electoral politics,
as they may be the deciding factor in which party wins an election.</p>
      <p>Three datasets were built and used in order to show that the methodology can be easily
generalized to diferent scenarios. Namely, we examined three Twitter network datasets
constructed with tweets from: 2017 Argentina parliamentary elections, 2019 Argentina presidential
elections and 2020 tweets about Donald Trump. For each dataset, we analyzed two diferent
time periods and identified the larger communities corresponding to the main political forces.
Using graph topological information and detecting topics of discussion of the first network, we
built and trained a model that classifies whether an individual will change his/her community
and find relevant features of the community-changing users.</p>
      <p>Our main contributions are the following:
1. We describe a generalized machine learning framework for social media users
classification, in particular, to detect and characterize community-changing users. This framework
includes natural language processing techniques and graph machine learning algorithms
in order to describe the topics of interests and interactions of each individual.
2. We experimentally analyze the machine learning framework by performing a feature
importance analysis. In particular we assert the importance of the low value of “PageRank”[23]
measure for this specific task. An interpretation of this result is that a person changes
their community because their message was not heard in their previous community.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data Collection</title>
      <p>Twitter has several APIs available for developers. Among them is the Streaming API that allows
the developer to download in real time a sample of tweets that are uploaded to the social network
ifltering it by language, terms, hashtags, etc. [ 21]. The data is composed of the tweet id, the
text, the date and time of the tweet, the user id and username, among other features. In case of
a retweet, it has also the information of the original tweet’s user account.</p>
      <p>For this research, we collected three datasets in two diferent periods of time: 2017 Argentina
parliamentary elections (2017ARG), 2019 Argentina presidential elections (2019ARG) and 2020
United States tweets of Donald Trump (2020US). For the Argentinian dataset, the Streaming
API was used during the week preceding the primary elections (from Aug 7ℎ to Aug 13ℎ
2017 and from Aug 5ℎ to Aug 12ℎ 2019) and the week before the general elections (from Oct
15ℎ to Oct 20ℎ 2017 and from Oct 20ℎ to Oct 27ℎ 2019). Keywords were chosen according
to the four main political parties present in the elections. For the 2020US dataset, we used
“realDonaldTrump” (the oficial account of president Donald Trump) as keyword and the weeks
from May 9ℎ to May 16ℎ and from June 10ℎ to June 16ℎ of 2020 as first and second time
period respectively. Details can be found in the appendix. We have analyzed more than 9 million
tweets and more than 1.5 million individuals in total.</p>
      <p>Ethical Considerations and Data Availability: The datasets were constructed entirely with
publicly available data as we do not collect any data form private accounts. For reproducibility,
we also make publicly available the Ids.</p>
      <p>https://github.com/fedealbanese/community-changing-users/</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>In this section, we will present the methodology employed to characterize the users. We describe
how we calculate each feature and implement a supervised model that classifies users who
changed their community over time. These models allow us to highlight which features are
relevant characteristics of the users.</p>
      <sec id="sec-3-1">
        <title>3.1. The retweet network</title>
        <p>
          We represent the interaction among individuals in terms of a graph  = (, ), where users are
nodes ( ) and retweets between them are edges (). Considering that a user can be retweeted
multiple times by another user, this is well modeled by a directed and weighted graph. However,
when a user 1 retweets a tweet written by another user 2, should the edge point form 1
to 2 or from 2 to 1? This definition has important implications. In the first scenario, the
edges represent pointers to the “influencers” and important content generators. In the second
scenario, the edges represent the flow of information through the network, going from the
source to the user who spread the message. Indeed, there is no clear consensus in the scientific
literature about which direction should be given to the edges: while some authors [
          <xref ref-type="bibr" rid="ref15 ref3">3, 17, 31</xref>
          ]
use the first, others [
          <xref ref-type="bibr" rid="ref11">27, 20, 11</xref>
          ] prefer the second one. A priori, we cannot tell which direction
is better for our purpose, so we decided to calculate the topological features in both scenarios.
We named the directions of the edges RC (from Retweeter to content Creator) and CR (from
content Creator to Retweeter).
        </p>
        <p>In Fig. 1, we can visualize the retweet network for each time period and dataset. In the case
of the US dataset, most of the users are concentrated in two groups, portraying the political
polarization in that country. On the other hand, in the Argentinean dataset we can identify two
large groups and also some smaller ones. The graph visualizations are produced with Force
Atlas 2 layout using Gephi software [15].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Unsupervised Learning: Community Detection</title>
        <p>
          In a given graph, a community is a set of nodes strongly connected among them and with little
or no connection with nodes of other communities [
          <xref ref-type="bibr" rid="ref16">32</xref>
          ]. We detect the communities in the
retweet network for each dataset using the Louvain method [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Given its stochasticity, we
follow the solution proposed by Lancichinetti et al. [18] that runs the method several times
(100 in our case). Then, only the nodes that were always consistently assigned to the same
community in all iterations were considered in this work, in order to minimize the possibility
of an incorrect labeling. We also only consider the users that received or made more than 5
retweets at each time period.
        </p>
        <p>
          Despite the fact that the algorithm found several communities, we just considered the 4
largest ones for the Argentinean datasets and the 2 larger ones for the US dataset since these
contain more than 90% of the users. We examine the text of the tweets and the users with
the highest degree of each of the biggest communities and found that each one had a clear
political orientation corresponding to the four biggest political parties in the election (beeing
“Cambiemos”, “Unidad Ciudadana”, “Partido Justicialista” and “1 Pais” for 2017ARG and “Frente
de Todos”, “Juntos por el Cambio”, “Consenso Federal” and “Frente de Izquierda-Unidad” for
2019ARG). Regarding the 2020US dataset, the 2 biggest communities corresponded to
Republicans and Democrats accounts. The United States has a bipartisan political system which can be
seen in Fig. 1, where only two big clusters concentrate almost all of the users and interactions.
In contrast, the Argentinean datasets have two principal communities and some minor
communities as well. This network topology with highly connected and polarized clusters had been
reported in previous works [
          <xref ref-type="bibr" rid="ref11 ref4">4, 11, 29</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Graph Features</title>
        <p>Given that the analyzed datasets comprise two snapshots of the retweet network separate in
time, we need to fully characterize the users in the early networks in order to properly identify
those users that change their community. With this goal, we computed the following metrics
for each user in the network: Degree, Indegree, Outdegree, PageRank, betweenness centrality,
clustering coeficient and cluster afiliation (the detected community). As we mentioned earlier,
it’s important to note that the direction of the edges of the network drastically afects the
value of these metrics. Consequently, we calculated them with both interpretations. All these
metrics were used as features in the machine learning classification task and feature importance
analysis.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Natural Language Processing Features</title>
        <p>The features described above are based on user interaction and arise from the topology of the
retweet network. We also characterized the topics of discussion during the first period of each
data set by analyzing the texts of the tweets.</p>
        <p>
          The features described above are based on the interaction of users and arise from the topology
of the retweet network. We also characterize the topics of discussion during the first period of
each dataset analyzing the texts of the tweets. Similarly to previous works [
          <xref ref-type="bibr" rid="ref1">1, 24</xref>
          ], first the tweets
were described as vectors through the Term Frequency - Inverse Document Frequency (tf-idf)
representation [26] and we used 3-grams and a modified stop-words dictionary that not only
contained articles, prepositions, pronouns and some verbs but also the names of the politicians,
parties and words like “election”. Then, we performed Non-Negative Matrix Factorization (NMF)
[30] to cluster our corpus of texts in topics. Finally, users were also characterized by a vector
where each cell corresponds to one of the topics and its value to the percentage of tweets the
user tweeted with that topic.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Feature importance analysis</title>
        <p>
          Given that our objective was to characterize users who change their community and start
interacting with users from other clusters, we implemented a machine learning model which
classifies users and then performed a feature importance analysis. The instances of the model
were the Twitter users who were active during both time periods [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and belonged to one of
the biggest communities in both time periods networks. Consequently, the number of users
considered at this stage was reduced. Individuals were characterized by a feature vector with
components corresponding to the mentioned topological metrics and others corresponding to
the percentage of tweets in each one of the topics. The information used to construct these
feature vectors was gathered only from the first time period, to avoid data leakage. The target
was a binary vector that takes the value 1 if the user changed communities between the first
and the second time periods and 0 otherwise. The summary of the datasets is shown in Table 1.
        </p>
        <p>
          We apply the gradient boosting technique XGBoost [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which uses an ensemble of predictive
models and has proven to be eficient in a wide variety of supervised scenarios outperforming
previous models [22]. We use a 67/33 random split between train and test. In order to do
hyper-parameter tuning of the XGBoost models, we use the randomized search method [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] over
the training dataset with 3-fold cross-validation.
        </p>
        <p>
          Finally, we performed random permutation of the features values among users in order to
understand which of them are the most important in the performance of our model (using the
so-called Permutation Feature Importance algorithm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]). In this way, we could identify the
most important characteristics that separates the users that do change their community from
those that do not change who they interact with.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We trained three diferent gradient boosting models for each dataset: the first one was trained
only with the features obtained via text mining (how many tweets of the selected topics the user
talks about); a second one was trained just with features obtained through complex network
analysis (degree, PageRank, betweenness centrality, clustering coeficient and cluster afiliation);
and the last one was trained with all the data. In this way, we could compare the importance
of natural language processing and the complex network analysis for the task of classifying
community-changing users.</p>
      <p>In Table 2 we can see the area under the ROC (receiver operating characteristic) curve [28]
of the diferent models for each dataset. The best performance is obtained in all cases by the
machine learning model built with all the features of the users, which is able to more eficiently
classify the users who changed their community. This result is expected, since an assembly of
models manages to have suficient depth and robustness to understand the network information,
the topics of the tweets and the graph characteristics of the users. Also, the model trained with
graph features outperformed the model with only text features in all three cases.</p>
      <p>We performed random permutation of the features values among users for the model trained
with all features (text+graph). We found that the most important feature in all cases
corresponds to the node’s connectivity:  , where the edges point from the tweet source
(the content creator) to the user who retweeted. The feature importance coeficients of the
  are 1635 (2017ARG), 2836 (2019ARG) and 843 (US2020). All other features
display even lower coeficients. In particular, the other   (corresponding to the
other direction of the edges) had importance feature coeficients of 717, 1202 and 527 for each
dataset respectively (a reduction greater than 40%). This means that there is a clear privileged
direction of edges for the task of detecting the users who changed their community.</p>
      <p>
        When comparing the   (PR) averages of these users with the users that did not
change their community, we observed that the latter had higher values in all cases (Table 3). We
applied the Kolmogorov-Smirnov test [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to the PR distributions of each set and found that these
diferences were statistically significant in all cases (  &lt; 0.001). The   measures how
relevant or important a user is in the retweet network based on the retweets of their messages
and the importance of the users who retweeted. The direction of   represents the
information flow in a network, starting from the tweet creator and then spreading through the
network. The fact that the community-changing users had statistically lower  
values means that these users were less relevant to the tweeter conversation and their messages
did not spread in their original community. A possible interpretation of these results is that a
user changes community when they do not have strong afinities with their community and
their messages have no response.
      </p>
      <p>The fact that the   is the most important feature is also consistent with the
model trained with network features getting a better   than the model trained with the
texts of the tweets in the three datasets.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper we presented a machine learning framework approach in order to identify and
characterize users who changed their community for another one. The framework includes
natural language processing techniques to detect their topics of interest and graph machine
learning algorithms in order to describe how an individual interacts with other users. The
framework was applied to three diferent datasets with similar results, showing that the methodology
can be easily generalized.</p>
      <p>We found that the users who changed communities had statistically lower values of  .
This graph feature was also the most important indicator of the classification task in all three
datasets according to the feature importance analysis. In particular, our results also show that
there is a clearly privileged direction on the network for this task, with the edges going from
the content creator to the retweeter. A possible interpretation for these last two results is that
users change who they interact with when they do not have strong afinities with other users,
their messages have no response and are not being “heard" by their community.
algorithm for handy network visualization designed for the gephi software. PloS one 9(6),
e98679 (2014)
[16] Jamieson, K.H., Cappella, J.N.: Echo chamber: Rush Limbaugh and the conservative media
establishment. Oxford University Press (2008)
[17] Kogan, M., Palen, L., Anderson, K.M.: Think local, retweet global: Retweeting by the
geographically-vulnerable during hurricane sandy. In: Proceedings of the 18th ACM
conference on computer supported cooperative work &amp; social computing. pp. 981–993
(2015)
[18] Lancichinetti, A., Fortunato, S.: Consensus clustering in complex networks. Scientific
reports 2(1), 1–7 (2012)
[19] Lima, L., Reis, J.C., Melo, P., Murai, F., Araujo, L., Vikatos, P., Benevenuto, F.: Inside the
right-leaning echo chambers: Characterizing gab, an unmoderated social system. In: 2018
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM). pp. 515–522. IEEE (2018)
[20] Morales, A.J., Borondo, J., Losada, J.C., Benito, R.M.: Measuring political polarization:
Twitter shows the two sides of venezuela. Chaos: An Interdisciplinary Journal of Nonlinear
Science 25(3), 033114 (2015)
[21] Morstatter, F., Pfefer, J., Liu, H., Carley, K.M.: Is the sample good enough? comparing
data from twitter’s streaming api with twitter’s firehose. In: Seventh international AAAI
conference on weblogs and social media (2013)
[22] Nielsen, D.: Tree boosting with xgboost-why does xgboost win" every" machine learning
competition? Master’s thesis, NTNU (2016)
[23] Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing
order to the web. Tech. rep., Stanford InfoLab (1999)
[24] Pinto, S., Albanese, F., Dorso, C.O., Balenzuela, P.: Quantifying time-dependent media
agenda and public opinion by topic modeling. Physica A: Statistical Mechanics and its
Applications 524, 614–624 (2019)
[25] Quattrociocchi, W., Scala, A., Sunstein, C.R.: Echo chambers on facebook. Available at</p>
      <p>SSRN 2795110 (2016)
[26] Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In:
Proceedings of the first instructional conference on machine learning. vol. 242, pp. 29–48.</p>
      <p>Citeseer (2003)
[27] Rath, B., Gao, W., Ma, J., Srivastava, J.: From retweet to believability: Utilizing trust to
identify rumor spreaders on twitter. In: Proceedings of the 2017 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2017. pp. 179–186 (2017)
[28] Rice, M.E., Harris, G.T.: Comparing efect sizes in follow-up studies: Roc area, cohen’s d,
and r. Law and human behavior 29(5), 615–620 (2005)
[29] Stewart, L.G., Arif, A., Starbird, K.: Examining trolls and polarization with a retweet
network. In: Proc. ACM WSDM, workshop on misinformation and misbehavior mining on
the web. vol. 70 (2018)
[30] Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization.</p>
      <p>
        In: Proceedings of the 26th annual international ACM SIGIR conference on Research and
development in informaion retrieval. pp. 267–273 (2003)
[
        <xref ref-type="bibr" rid="ref15">31</xref>
        ] Yang, M.C., Lee, J.T., Lee, S.W., Rim, H.C.: Finding interesting posts in twitter based on
retweet graph analysis. In: Proceedings of the 35th international ACM SIGIR conference
on Research and development in information retrieval. pp. 1073–1074 (2012)
[
        <xref ref-type="bibr" rid="ref16">32</xref>
        ] Yang, Z., Algesheimer, R., Tessone, C.J.: A comparative analysis of community detection
algorithms on artificial networks. Scientific reports 6(1), 1–18 (2016)
[15] Jacomy, M., Venturini, T., Heymann, S., Bastian, M.: Forceatlas2, a continuous graph layout
algorithm for handy network visualization designed for the gephi software. PloS one 9(6),
e98679 (2014)
[16] Jamieson, K.H., Cappella, J.N.: Echo chamber: Rush Limbaugh and the conservative media
establishment. Oxford University Press (2008)
[17] Kogan, M., Palen, L., Anderson, K.M.: Think local, retweet global: Retweeting by the
geographically-vulnerable during hurricane sandy. In: Proceedings of the 18th ACM
conference on computer supported cooperative work &amp; social computing. pp. 981–993
(2015)
[18] Lancichinetti, A., Fortunato, S.: Consensus clustering in complex networks. Scientific
reports 2(1), 1–7 (2012)
[19] Lima, L., Reis, J.C., Melo, P., Murai, F., Araujo, L., Vikatos, P., Benevenuto, F.: Inside the
right-leaning echo chambers: Characterizing gab, an unmoderated social system. In: 2018
IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining
(ASONAM). pp. 515–522. IEEE (2018)
[20] Morales, A.J., Borondo, J., Losada, J.C., Benito, R.M.: Measuring political polarization:
Twitter shows the two sides of venezuela. Chaos: An Interdisciplinary Journal of Nonlinear
Science 25(3), 033114 (2015)
[21] Morstatter, F., Pfefer, J., Liu, H., Carley, K.M.: Is the sample good enough? comparing
data from twitter’s streaming api with twitter’s firehose. In: Seventh international AAAI
conference on weblogs and social media (2013)
[22] Nielsen, D.: Tree boosting with xgboost-why does xgboost win" every" machine learning
competition? Master’s thesis, NTNU (2016)
[23] Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing
order to the web. Tech. rep., Stanford InfoLab (1999)
[24] Pinto, S., Albanese, F., Dorso, C.O., Balenzuela, P.: Quantifying time-dependent media
agenda and public opinion by topic modeling. Physica A: Statistical Mechanics and its
Applications 524, 614–624 (2019)
[25] Quattrociocchi, W., Scala, A., Sunstein, C.R.: Echo chambers on facebook. Available at
      </p>
      <p>SSRN 2795110 (2016)
[26] Ramos, J., et al.: Using tf-idf to determine word relevance in document queries. In:
Proceedings of the first instructional conference on machine learning. vol. 242, pp. 29–48.</p>
      <p>Citeseer (2003)
[27] Rath, B., Gao, W., Ma, J., Srivastava, J.: From retweet to believability: Utilizing trust to
identify rumor spreaders on twitter. In: Proceedings of the 2017 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining 2017. pp. 179–186 (2017)
[28] Rice, M.E., Harris, G.T.: Comparing efect sizes in follow-up studies: Roc area, cohen’s d,
and r. Law and human behavior 29(5), 615–620 (2005)
[29] Stewart, L.G., Arif, A., Starbird, K.: Examining trolls and polarization with a retweet
network. In: Proc. ACM WSDM, workshop on misinformation and misbehavior mining on
the web. vol. 70 (2018)
[30] Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization.</p>
      <p>In: Proceedings of the 26th annual international ACM SIGIR conference on Research and
development in informaion retrieval. pp. 267–273 (2003)
In this section we specified the keywords used for collecting the tweets.
2017 Argentina parliamentary elections:
The tweets were restricted to be in Spanish and the following terms were chosen as keywords
for tweeter: the candidates for Senate of the main four parties: their name and oficial user on
Twitter (i.e., “SergioMassa”, “Massa”, “RandazzoF”, “Randazzo”, “estebanbullrich”, “Bullrich”,
“CFKArgentina”, “CFK” and “Kirchner”). The name of the oficial accounts of the first candidates
for deputies of the parties (i.e., “felipe_sola”, “BuccaBali”, “gracielaocana” and “fvallejoss”).
The name of the oficial accounts of political parties on Twitter (i.e., “1PaisUnido”, “1Pais”,
“FJCumplir”, “Frente Justicialista”, “cambiemos”, “UniCiudadanaAR” and “Unidad Ciudadana”).
The President of Argentina and the governor of the province of Buenos Aires at the time of
elections (i.e., “mauriciomacri”, “Macri” and “mariuvidal”).
2019 Argentina presidential election:
The tweets were restricted to be in Spanish and the following terms were chosen as keywords for
tweeter: “Elisacarrio”, “OfeFernandez_”, “PatoBullrich”, “macri”, “macrismo”, “mauriciomacri”,
“pichetto”, “MiguelPichetto”, “JuntosPorElCambio”, “alferdez”, “CFKArgentina”, “CFK”,
“kirchner”, “kirchnerismo”, “FrenteTodos”, “FrenteDeTodos”, “Lavagna”, “RLavagna”, “Urtubey”,
“UrtubeyJM”, “ConsensoFederal”, “2030ConsensoFederal”, “DelCaño”, “NicolasdelCano”, “DelPla”,
“RominaDelPla”, “FitUnidad”, “FdeIzquierda”, “Fte_Izquierda”, “Castañeira”, “ManuelaC22”,
“Mulhall”, “NuevoMas”, “Espert”, “jlespert”, “FrenteDespertar”, “Centurion”, “juanjomalvinas”,
“Hotton”, “CynthiaHotton”, “Biondini”, “Venturino”, “FrentePatriota”, “RomeroFeris”,
“PartidoAutonomistaNacional”, “Vidal”, “mariuvidal”, “Kicillof”, “Kicillofok”, “Bucca”, “BuccaBali”,
“chipicastillo”, “Larreta”, “horaciorlarreta”, “Lammens”, “MatiasLammens”, “Tombolini”,
“matiastombolini”, “Solano”, “Solanopo”, “Lousteau”, “GugaLusto”, “Recalde”, “marianorecalde”,
“RAMIROMARRA”, “Maxiferraro”, “fernandosolanas”, “MarcoLavagna”, “myriambregman”,
“cristianritondo”, “Massa”, “SergioMassa”, “GracielaCamano”, “nestorpitrola”.
2020 tweets of Donald Trump:
The following term was used as a keyword for the twitter API: “realDonaldTrump”. In addition,
the tweets were restricted to be in English.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Albanese</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semeshenko</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Balenzuela</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Analyzing mass media influence using natural language processing and time series analysis</article-title>
          .
          <source>Journal of Physics: Complexity</source>
          <volume>1</volume>
          (
          <issue>2</issue>
          ),
          <volume>025005</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Altmann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toloşi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lengauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Permutation importance: a corrected feature importance measure</article-title>
          .
          <source>Bioinformatics</source>
          <volume>26</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1340</fpage>
          -
          <lpage>1347</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Amati</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Angelini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Capri</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gambosi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vocca</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>On the retweet decay of the evolutionary retweet graph</article-title>
          .
          <source>In: International Conference on Smart Objects and Technologies for Social Good</source>
          . pp.
          <fpage>243</fpage>
          -
          <lpage>253</lpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Aruguete</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calvo</surname>
          </string-name>
          , E.:
          <article-title>Time to# protest: Selective exposure, cascading activation, and framing in social media</article-title>
          .
          <source>Journal of communication 68(3)</source>
          ,
          <fpage>480</fpage>
          -
          <lpage>502</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Bergstra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Random search for hyper-parameter optimization</article-title>
          .
          <source>Journal of machine learning research 13(2)</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>V.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guillaume</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lambiotte</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lefebvre</surname>
          </string-name>
          , E.:
          <article-title>Fast unfolding of communities in large networks</article-title>
          .
          <source>Journal of statistical mechanics: theory and experiment</source>
          <year>2008</year>
          (
          <volume>10</volume>
          ),
          <source>P10008</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cazabet</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossetti</surname>
          </string-name>
          , G.:
          <article-title>Challenges in community discovery on temporal networks</article-title>
          .
          <source>In: Temporal Network Theory</source>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>197</lpage>
          . Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guestrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          .
          <source>In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining</source>
          . pp.
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Choi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chun</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oh</surname>
            , H., Han,
            <given-names>J</given-names>
          </string-name>
          ., et al.:
          <article-title>Rumor propagation is amplified by echo chambers in social media</article-title>
          .
          <source>Scientific reports 10(1)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Cinelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morales</surname>
            ,
            <given-names>G.D.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Galeazzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quattrociocchi</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Starnini</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The echo chamber efect on social media</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>118</volume>
          (
          <issue>9</issue>
          ) (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Conover</surname>
            ,
            <given-names>M.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratkiewicz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Francisco,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gonçalves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Menczer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Flammini</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Political polarization on twitter</article-title>
          .
          <source>In: Fifth international AAAI conference on weblogs and social media</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Del Vicario</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bessi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zollo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petroni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scala</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caldarelli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanley</surname>
            ,
            <given-names>H.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quattrociocchi</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>The spreading of misinformation online</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>113</volume>
          (
          <issue>3</issue>
          ),
          <fpage>554</fpage>
          -
          <lpage>559</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Gruzd</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roy</surname>
          </string-name>
          , J.:
          <article-title>Investigating political polarization on twitter: A canadian perspective</article-title>
          .
          <source>Policy &amp; internet 6(1)</source>
          ,
          <fpage>28</fpage>
          -
          <lpage>45</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Hodges</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>The significance probability of the smirnov two-sample test</article-title>
          .
          <source>Arkiv för Matematik</source>
          <volume>3</volume>
          (
          <issue>5</issue>
          ),
          <fpage>469</fpage>
          -
          <lpage>486</lpage>
          (
          <year>1958</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>J.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rim</surname>
            ,
            <given-names>H.C.</given-names>
          </string-name>
          :
          <article-title>Finding interesting posts in twitter based on retweet graph analysis</article-title>
          .
          <source>In: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</source>
          . pp.
          <fpage>1073</fpage>
          -
          <lpage>1074</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Algesheimer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tessone</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>A comparative analysis of community detection algorithms on artificial networks</article-title>
          .
          <source>Scientific reports 6(1)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>