<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Four Feature Types Approach for Detecting Bot and Gender of Twitter Users</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Swedish Defence Research Agency</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The main ideas of our classi cation model used in the PAN Bot and Gender pro ling task 2019 was to combine di erent feature types with the ambition to detect di erent styles in writing to distinguishing bots, females and males from each other. We included both word and character TF-IDF features together with compression and tweet features. As classi cation algorithm we used the CatBoost method. We trained two models, one for the English data and one for the data in Spanish. We achieved highest accuracy with our English model. Both models performed better in distinguishing bots and humans rather than distinguishing females and males. For both languages we achieved an higher accuracy of the bot or human classi cation rather than the female or male classi cation.</p>
      </abstract>
      <kwd-group>
        <kwd>Bot detection</kwd>
        <kwd>Gender pro ling</kwd>
        <kwd>Twitter</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        For several years, bots have been used for a large variety of purposes. Initially
their purpose were to automate otherwise unwieldy online processes which could
not be done manually, and have now become known commonly for mostly being
used for commercial purposes such as directing Internet users to advertisements
and posting spam in di erent social media channels. Bots are also often used to
further illegal activity such as collecting data from users for criminal gain. Bot
detection is therefore important for a variety of security purposes. Bot
detection has for example been used when monitoring large events such as elections,
with the aim to prevent in uential operations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Gender pro ling from text is
an important step in author pro ling and can also be used for marketing and
commercial purposes.
      </p>
      <p>
        In this notebook, we will present the necessary steps for reproducing our
model used in the PAN [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] 2019 Bot and Gender pro ling [13] task. We also
brie y describe what we hope to capture with the di erent types of features.
The concept of the model is illustrated in gure 1.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Previous work</title>
      <p>
        There is a lot of research on bot detection such as [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] where a large variety of
features which have seen to perform well in previous researches are combined.
In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], two type of features are used - meta-features and tweet features. In [15] a
total of 1,150 di erent features are used to train a supervised machine learning
model to bots. For example, the features consists of part-of-speech-tags (POS),
time features such as the statistics of times between consecutive tweets, retweets,
and mentions and entropy of words in a tweet. We have taken these feature types
in consideration while we developed our own model. Gender classi cation from
text is a well-researched problem. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], it is clear that POS-tags is an important
type of features when doing gender classi cation.
2
      </p>
      <sec id="sec-2-1">
        <title>Model</title>
        <p>Both training and test data for the task was stored in an xml le. Every tweet
for every user was preprocessed by removing all markers for tabs and citation
characters (").
2.2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Features</title>
      <p>We calculated the feature types for the training and testing users and then
concatenated feature vectors for every feature type and user. Each of the feature
types are described in detail below.</p>
      <p>
        Term frequency-inverse document frequency (words) TF-IDF is a
statistical model which calculates the importance of words in a corpus and valuates
words that occurs more often in fewer documents higher. For a complete
description of TF-IDF, see [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. For this task we trained a TF-IDF model with all
our training data. Then, for each user in the training data, we concatenated all
their tweets into one string and calculated the TF-IDF values for every training
and testing user. The TF-IDF model saved n-grams from 1 to 3 and due to
time e ciency a maximum of 2000 features were used. Since it seems unlikely
that the occurrence of a term is as signi cant as it's importance, sublinear term
frequency was applied, as well as smooth inverse document frequency (meaning
that every inverse document frequency is increased by 1). The TF-IDF features
were used with Python's Scikit learn[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. With the TF-IDF on words features,
we hope to capture that bots, females and males care to discuss di erent type
of topics and that bots might have a more compressed feature vector i.e. uses
several terms more often and have a decreased variety of words used compared
to humans.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Term frequency-inverse document frequency (characters) For the TF</title>
      <p>IDF features weighted on characters, the approach of the TF-IDF model is the
same as described above but instead of calculating the importance of words, the
model is calculating the importance of combination of characters. We included
character n-grams from 1 to 4 and we did not want to include uncommon
character n-grams so we set a minimum document frequency of 20 percent with a
maximum of 2000 features. By using TF-IDF on chars as features, we mainly
hope to catch the di erent uses of blank space, and di erent symbols in
conjunction with letters and digits.</p>
      <p>Compression features Compression features were used by compressing the
concatenated tweets of each user and do di erent statistical calculations of the
compressed tweets. The reason for including compressing features was based
on the assumption that bots might communicate in a more monotonous and
repetitious way compared to humans. Especially spambots are more likely to
just post the same tweet over and over again maybe not changing the content
at all. We wanted the compression features to catch that kind of behavior by
detect a di erence in compression ratios between human and bot accounts.</p>
      <p>
        We used Python's zip le module to compress every users' own concatenated
tweets into the three di erent compression methods De ated, BZIP2, and LZMA
which are all included in the module. To obtain the compression feature vector
for a user, we concatenated the following entities giving us 19 features:
{ Original size (size of all concatenated tweets of a user before compression)
{ Compression size for every compression method
{ Mean, median, popularity standard deviation, standard deviation, max value
and min value for the compression sizes
{ Normalized compression (each compression size divided by original size)
{ Mean, median, popularity standard deviation, standard deviation, max value
and min value for the normalized compressions
Tweet features The tweet features consist of a variety of features connected to
the attributes of a user's way of tweeting and the content of the tweets. We have
already done some classi cation regarding bot detection on tweets in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], but in
this task we have no time stamps for the tweets or meta data of the users, and
therefore some of the features di ers from our previous method. We have also
included some additional features such as part-of-speech tags and pronouns.
      </p>
      <p>Several of the attributes calculated for a user consist of vectors, and these
vectors have been represented as features by calculating statistics of the vector.
The statistics calculated for the vectors are always mean, median, popularity
standard deviation, standard deviation, maximum value and minimum value.
All tweet features are listed below:
{ Retweet ratio (number of tweets that are retweets divided by number of
posted tweets)
{ The character length of all tweets concatenated
{ Shannon entropy[14] of all tweets concatenated
{ Number of unique words for all tweets
{ Number of tweets that have been truncated during the crawling process.</p>
      <p>
        They are always nished with a character showing three dots (...).
{ Number of di erent characters the tweets are started with
{ Number of unique starting character (including only letters and numbers)
{ Whether or not the user always starts the tweet with a mentioning of another
user
{ Number of di erent characters the tweets are nished with
{ Number of tweets mentioning the word bot
{ Number of unique hashtags used divided by the total number of used
hashtags
{ Number of unique hashtags used divided by the total number of tweets
{ Number of unique users mentioned divided by the total number of mentioned
users
{ Number of unique users mentioned divided by the total number of tweets
{ Number of unique tweets published divided by the total number of tweets
{ Number of unique 30 character beginnings of tweets
{ Number of unique 8 character beginnings of tweets
{ Number of tweets without including any hashtags, mentioning and URL:s or
being a retweet, divided by the total number of tweets
{ Number of unique emojis used
{ Number of unique emojis used divided by the total number of emojis used
{ Number of unique characters to end tweets with
{ Number of unique URL:s in tweets
{ Number of unique URL:s in tweets divided by the total number of URL:s in
tweets
{ Number of unique domains linked to
{ Number of unique domains linked to divided by the total number of linked
domains
{ Statistics of number of URL:s per tweet
{ Statistics of length of tweets
{ Statistics of number of mentionings per tweet
{ Statistics of Shannon entropy per tweet
{ Statistics of number of hashtags per tweet
{ Statistics of number of words per tweet
{ Statistics of number of pronouns per tweet
{ Statistics of number of upper case letters per tweet
{ Statistics of number of lower case letters per tweet
{ Statistics of number of blank space per tweet
{ Statistics of number of digits per tweet
{ Statistics of number of row breaks per tweet
{ Statistics of number of tweets between two tweets including a hashtag
{ Statistics of number of tweets between two tweets including a URL
{ Statistics of number of tweets between two tweets including a mentioning
{ Statistics of number of tweets between two tweets including a question sign
{ Statistics of number of tweets between two tweets being retweets
{ Statistics of Levenshtein distance between every following tweets. Read more
about the Levenshtein distance in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
{ Statistics of number of Part-of-speech (POS) vector where every element in
the vector corresponds to the occurrence of a speci c POS-tag. POS-tagging
is done with the Natural language toolkit[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Some features' denominators are increased by 1 to prevent division by zero. The
feature vector for the tweet features consist of 139 features.</p>
      <p>With the tweet features, we hope to distinguish the bots and humans from
each others in many ways. We went through the labeled data manually and could
for example see that there often were accounts which always started their tweets
with a mentioning, or always retweeted someone. In our labeled data we also saw
that women were using emojis more frequently which motivated us to implement
the features regarding emojis. With the hypothesis that a bot wants to contact
and be seen by as many users as possible (for commercial purposes for example)
the features concerning the use of mentionings and hashtags are important. If
an account is used for generating tra c to a website (which could be likely for a
spambot), the number of di erent URL:s posted would be reasonably small, but
should occur in several tweets. The Levenshtein feature, the statistics of number
of tweets between two tweets including hashtags, URL:s etc. and the entropy
features are all used for detecting the content is changed between tweets. It
seems more reasonably that a bot would not change the content of the tweets
as a human.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Classi cation algorithm</title>
      <p>
        Initially we used the Random forest algorithm for classi cation, but we later
discovered that CatBoost gave us better performance. The CatBoost algorithm
is based on gradient boosting over decision trees. The CatBoost classi cation
algorithm is further described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>For the CatBoost classi er, we used our training data for training the model,
and to prevent the model from over tting, we used the test set as validation
data. Since the evaluation metric for the PAN Bot and gender task would be
accuracy, we chose accuracy to be the metric to select the best nal model after
a total of 5000 iterations.
3</p>
      <sec id="sec-5-1">
        <title>Experiment and results</title>
        <p>
          We parsed every tweet for every user training and test users. Our TF-IDF models
were trained with the tweets from our training users, and then calculated the
TF-IDF, compression and tweet features for all of our users. This gave us a
total of 4158 features calculated for each of the users. We let the CatBoost
model learn for 5000 iterations training on our training users and validating on
our test users. We then saved the model giving us the best accuracy for the
validation set. This was done for English and Spanish separately and resulted in
two di erent models. These two models were then used for the dataset provided
in the TIRA[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] environment and the results are shown in table 1.The Low
Dimensionality Statistical Embedding (LDSE) baseline described in [12] is also
included in the table.
        </p>
        <p>It is clear that our model performs better on the English data compared
to the data in Spanish. It might be several reasons for this. Since the TF-IDF
features are the only language dependent features, the signals of the English
language regarding gender pro ling might be harder to catch in Spanish. There
might also be the case that the data set in Spanish is more complex, making
that problem harder to solve. The bot or human classi cation seems to be an
easier classi cation task for our models for both languages compared to the
gender classi cation. It is also clear that our model performs better on all of the
di erent tasks compared to the LDSE baseline.
12. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation
for language variety identi cation. In: International Conference on Intelligent
Text Processing and Computational Linguistics. pp. 156{169. Springer (2016)
13. Rangel, F., Rosso, P.: Overview of the 7th Author Pro ling Task at PAN 2019:
Bots and Gender Pro ling. In: Cappellato, L., Ferro, N., Losada, D., Muller, H.
(eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep
2019)
14. Shannon, C.E.: A mathematical theory of communication. Bell system technical
journal 27(3), 379{423 (1948)
15. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online
human-bot interactions: Detection, estimation, and characterization. CoRR
abs/1703.03107 (2017), http://arxiv.org/abs/1703.03107</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Black</surname>
            ,
            <given-names>P.E.</given-names>
          </string-name>
          :
          <article-title>Dictionary of algorithms and data structures</article-title>
          .
          <source>National Institute of Standards and Technology Gaithersburg</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjavancas</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zangerle</surname>
          </string-name>
          , E.: Overview of PAN 2019:
          <article-title>Author Pro ling, Celebrity Pro ling, Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Crestani,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Braschler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Savoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Rauber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            , Muller, H.,
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Heinatz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <source>Proceedings of the Tenth International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ). Springer (Sep
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dorogush</surname>
            ,
            <given-names>A.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ershov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Catboost: gradient boosting with categorical features support</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>11363</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fernquist</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaati</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Online monitoring of large events</article-title>
          .
          <source>In: 2019 IEEE International Conference on Intelligence and Security Informatics (ISI)</source>
          .
          <source>IEEE</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fernquist</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaati</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schroeder</surname>
          </string-name>
          , R.:
          <article-title>Political bots and the swedish general election</article-title>
          .
          <source>In: 2018 IEEE International Conference on Intelligence and Security Informatics (ISI)</source>
          . pp.
          <volume>124</volume>
          {
          <fpage>129</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gilani</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farahbakhsh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tyson</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crowcroft</surname>
          </string-name>
          , J.:
          <article-title>Of bots and humans (on twitter)</article-title>
          .
          <source>In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining</source>
          <year>2017</year>
          . pp.
          <volume>349</volume>
          {
          <fpage>354</fpage>
          . ASONAM '17,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2017</year>
          ). https://doi.org/10.1145/3110025.3110090, http://doi.acm.
          <source>org/10</source>
          .1145/3110025.3110090
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Isbister</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaati</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Gender classi cation with data independent features in multiple languages</article-title>
          .
          <source>In: 2017 European Intelligence and Security Informatics Conference (EISIC)</source>
          . pp.
          <volume>54</volume>
          {
          <fpage>60</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Nltk: The natural language toolkit</article-title>
          .
          <source>In: In Proceedings of the ACL Workshop on E ective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics</source>
          . Philadelphia: Association for Computational Linguistics (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duchesnay</surname>
          </string-name>
          , E.:
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          ,
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>TIRA Integrated Research Architecture</article-title>
          . In: Ferro,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>C</surname>
          </string-name>
          . (eds.)
          <article-title>Information Retrieval Evaluation in a Changing World - Lessons Learned from 20 Years of</article-title>
          CLEF. Springer (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ullman</surname>
            ,
            <given-names>J.D.</given-names>
          </string-name>
          :
          <article-title>Mining of massive datasets</article-title>
          . Cambridge University Press (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>