<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Machine Learning Algorithms f or Author Profiling In Social Media</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Dichiu</string-name>
          <email>ddichiu@bitdefender.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irina Rancea</string-name>
          <email>irancea@bitdefender.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <country>Bitdefender Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present our approach of solving the PAN 2016 Author Profiling Task. It involves classifying users' gender and age using social media posts. We used SVM classifiers and neural networks on TF-IDF and verbosity features. Results showed that SVM classifiers are better for English datasets and neural networks perform better for Dutch and Spanish datasets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Due to the huge amount of text information on the Internet, both the academia and
industry have developed an interest in author profiling. It consists of discovering as
much insight as possible about an unknown author by analyzing his data posted
online. The PAN Author Profiling task is focusing this year on gender and age
classification. The training documents consist of tweets, while the evaluation is
performed on blogs or other social media documents, except tweets. Similar
contributions on classifying age and gender on short texts obtained from tweets has
been developed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], using TIRA platform ([
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Training documents are
provided for three languages: English, Spanish and Dutch.
      </p>
      <p>
        Our approach for the classification tasks implies using the scikit-learn LinearSVC
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and a neural network based on nolearn Lasagne module [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as distinct predictors.
For the feature extraction part we used vectorizers from scikit-learn module for
python.
      </p>
      <p>
        For features we tried a tf-idf matrix at both character and word level with various
n-gram ranges and fine tuning for the rest of parameters depending on the language
and subtask.[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] We computed the tf-idf matrix using TfidfVectorizer from scikit-learn
Python module. Before vectorizing data we concatenate all tweets for each user.
      </p>
      <p>
        The authors in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] obtained good results in PAN 2015 Author Profiling competition
with SVM classifiers on tf-idf matrices at character level. However, the training and
testing datasets were based on the same type of social media, while PAN 2016 Author
Profiling competition’s training and testing datasets were based on different types of
social media (e.g. Twitter for training dataset and blogs for testing dataset). Taking
this into consideration, we thought a tf-idf matrix at word level would better
generalize the classification model and so we trained models based on both types of
tf-idf matrices.
      </p>
      <p>
        We combined, in a scikit-learn FeatureUnion structure, the tf-idf scores with a
verbosity rate computed as a type/token ratio, as was done in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>There were 3 types of classifiers:
1. Support Vector Machine (SVM1 hereinafter), based on verbosity and features
extracted with tf-idf at character level;
2. Support Vector Machine (SVM2 hereinafter), based on verbosity and features
extracted with tf-idf at word level;
3. Neural Network (NN hereinafter), based on features extracted with tf-idf at
word level.</p>
      <p>
        To find good parameters that do not overfit, we used scikit-learn’s StratifiedKFold
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for the cross-validation phase of the SVMs.
      </p>
      <p>For SVM1 the LinearSVC parameters common for all running tests were: dual =
False, loss = squared_hinge, penalty = l2. Table 1, on page 2, and table 2, on page 3,
summarizes the parameters we found as optimal for SVM1. Parameters which are
missing in the table have the default value.</p>
      <p>For SVM2 the LinearSVC algorithm was used with default parameters. Table 3 on
page 3 summarizes the parameters we found as optimal for this classifier. Parameters
which are missing in the table have the default value.</p>
      <p>NN is a neural network classifier, with 2 hidden layers, each hidden layer having
50 nodes. The input features were based on a tf-idf matrix at word level, reduced to
50-dimensional feature space using scikit-learn’s TruncatedSVD. Table 4 on page 3
summarizes the parameters used with this neural network.</p>
      <p>
        To reduce the impact of overfitting, we used a dropout layer [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] (with a dropout
probability of 50%) between the hidden layers. We also made use of early stopping
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and the maximum number of epochs for each classifier is reported in tables 5, 6,
and 7, on page 4.
      </p>
      <sec id="sec-1-1">
        <title>Subtask</title>
        <p>Gender
Age</p>
      </sec>
      <sec id="sec-1-2">
        <title>Algorithm</title>
        <p>TfidfVectorizer
LinearSVC
TfidfVectorizer
LinearSVC</p>
      </sec>
      <sec id="sec-1-3">
        <title>Parameter Name</title>
        <p>max_df
ngram_range
all
max_df
ngram_range
all</p>
      </sec>
      <sec id="sec-1-4">
        <title>Parameter Value</title>
        <p>0.7
1,1
defaults
0.7
1,1
defaults
Parameter Name
layers
layer_1_num_units
layer_1_dropout
layer_2_num_units
output_nonlinearity
update
update_learning_rate
update_momentum
eval_size</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3 Results</title>
      <p>For Dutch, the best results were obtained using a tf-idf at word level, reduced to a
50-dimensional space and then classified with a neural network which was trained for
1600 epochs.</p>
    </sec>
    <sec id="sec-3">
      <title>4 Conclusions</title>
      <p>All the classifiers suffered from overfitting. During the cross-validation phase of
our training, we registered accuracies around 0.8, nowhere near the accuracy score on
the test datasets. However, the types of features and models we used on English and
Spanish generalize better from training dataset to testing dataset 2, while accuracies
on the testing dataset 1 are, on average, about 10 percentage points lower. This could
mean that at the feature level of our choosing, training dataset and testing dataset 2
are more similar than training dataset and testing dataset 1. Based on our results, we
can say that word level features are better for generalization when used with a linear
SVM. Also, neural networks, when trained carefully, can outperform SVMs using the
same feature set.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verhoeven</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Evaluations Concerning Cross-genre Author Profiling</article-title>
          .
          <source>In: Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2016</year>
          )
          <article-title>Shlomo Argamon and Anat Rachet Shimoni</article-title>
          .
          <article-title>Automatically categorizing written texts by author gender</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>17</volume>
          :
          <fpage>401</fpage>
          -
          <lpage>412</lpage>
          ,
          <year>2003</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Toms</surname>
          </string-name>
          , E. (eds.)
          <article-title>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF 14)</source>
          . pp.
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burrows</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoppe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : TIRA: Configuring, Executing, and
          <article-title>Disseminating Information Retrieval Experiments</article-title>
          . In: Tjoa,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Liddle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Schewe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.D.</given-names>
            ,
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>X</surname>
          </string-name>
          . (eds.) 9th
          <source>International Workshop on Text-based Information Retrieval (TIR 12) at DEXA</source>
          . pp.
          <fpage>151</fpage>
          -
          <lpage>155</lpage>
          . IEEE, Los Alamitos, California (Sep
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Scikit-learn: Machine Learning in Python, Pedregosa et al.,
          <source>JMLR 12</source>
          , pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          ,
          <year>2011</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dieleman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sander</surname>
          </string-name>
          , et al.
          <source>"Lasagne: First Release." Zenodo: Geneva</source>
          ,
          <string-name>
            <surname>Switzerland</surname>
          </string-name>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>John</given-names>
            <surname>Houvardas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>N-gram feature selection for authorship identification</article-title>
          .
          <source>In Artificial Intelligence: Methodology, Systems and Applications</source>
          ,
          <volume>77</volume>
          -
          <fpage>86</fpage>
          , Spinger,
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Octavia-Maria Sulea</surname>
            , and
            <given-names>Daniel</given-names>
          </string-name>
          <string-name>
            <surname>Dichiu</surname>
          </string-name>
          .
          <article-title>Automatic Profiling of Twitter users based on their tweets - Notebook for PAN at CLEF 2015</article-title>
          . In Linda Cappellato, Nicola Ferro, Gareth Jones, and Eric San Juan, editors,
          <source>CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers</source>
          ,
          <fpage>8</fpage>
          -
          <issue>11</issue>
          <year>September 2015</year>
          .
          <article-title>CEUR-WS.org</article-title>
          .
          <source>ISSN 1613-0073</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Srivastava</surname>
            , Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
            <given-names>Ruslan</given-names>
          </string-name>
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>15</volume>
          , no.
          <issue>1</issue>
          (
          <year>2014</year>
          ):
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bengio</surname>
          </string-name>
          , Yoshua.
          <article-title>Practical recommendations for gradient-based training of deep architectures</article-title>
          .
          <source>In Neural Networks: Tricks of the Trade</source>
          , pp.
          <fpage>437</fpage>
          -
          <lpage>478</lpage>
          . Springer Berlin Heidelberg,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>