Exploring Word Embeddings and Character N -Grams
               for Author Clustering
                           Notebook for PAN at CLEF 2016


                                 Yunita Sari and Mark Stevenson

                     Department of Computer Science, University of Sheffield
                                 Regent Court, 211 Portobello
                              Sheffield S1 4DP, United Kingdom
                        E-Mail:{y.sari, mark.stevenson}@sheffield.ac.uk


          Abstract We presented our system for PAN 2016 Author Clustering task. Our
          software used simple character n-grams to represent the document collection.
          We then ran K-Means clustering optimized using the Silhouette Coefficient. Our
          system yields competitive results and required only a short runtime. Character
          n-grams can capture a wide range of information, making them effective for au-
          thorship attribution. We also present a comparison study of two different features:
          character n-grams and word embeddings.


1      Author Clustering

This report describes our system that participated in the PAN 2016 Author Clustering
task [18]. The task is to create clusters from a document collection, where each cluster
represents a different author. The task itself consists of two different scenarios: The
first, complete author clustering is to create k different clusters represent k authors
and assign each document to exactly one of those clusters. The second, authorship-
link ranking is reminiscent of information retrieval. Given a group of documents in
the same cluster, we have to provide confidence scores between pairs of documents,
indicating the likelihood that the document pair was written by the same author.
This year’s task1 consists of 18 problems in 3 languages (English, Dutch and Greek) and
2 genres (newspaper articles and reviews). For each problem the language and genre are
uniform, but topics may differ. The lengths of documents vary from a few hundred to a
few thousand words.
Evaluation The author clustering task is evaluated on both scenarios, thus two different
outputs need to be produced. The BCubed F-Score [1] is used to estimate the quality of
clustering. In this case, precision and recall of each item are computed. Precision of an
item corresponds to how many items in the same cluster belong to its category, while
recall is calculated by counting items from its category that appeared in its cluster. To
evaluate the authorship-link ranking, Mean Average Precision (MAP) [9] is used. This
 1
     http://pan.webis.de/clef16/pan16-web/author-identification.html
metrics is commonly used in information retrieval task. High score of MAP will be
obtained if the system could retrieve most relevant documents to the queries.
To achieve a good performance in both scenarios, we need an optimized clustering algo-
rithm and at the same time choose the right features that could effectively discriminate
each author’s writing style. We therefore, divide our exploration into two parts: first, we
investigate two different features, character n-grams and word embeddings which have
been known for their success in text classification tasks. As both of the features work
in different ways, we are particularly interested on how they characterize each author’s
writing. Second, we perform hyperparameter tuning on the clustering algorithm in order
to find a model with the optimal number of clusters.
Our system is described in the next section. Results are reported in Section 3 and con-
clusions drawn in Section 4.


2      System Description

During system development, two different features were used: word embeddings and
character n-grams. Our main goal is to investigate whether word embeddings could
perform well on a multi-topic author attribution task. The semantic information in
word embeddings has been shown to effectively capture similarities between docu-
ments [8,19,7]. We expect similar performance on authorship attribution. We also de-
veloped another system using character n-grams. Previous work [3,5,6] found that char-
acter n-grams are a highly effective feature for authorship attribution.
K-Means was used to define clusters in the document collection. We optimized the hy-
perparameter k by calculating the Silhouette Coefficient [15] for each of the sample.
Our system is developed using Python. We also used Scikit-learn library2 [12] to im-
plement TfIdfVectorizer for character n-gram, K-Means and the Silhouette Coefficient.
Word embeddings were trained using Gensim word2vec3 [14]. We compared perfor-
mance of each feature on the training dataset and only submitted the system with the
best result.


2.1     Document Representation

Word Embeddings Semantic information has rarely been used in authorship attribu-
tion. This is mainly due to the unavailability of NLP tools that could perform semantic
analysis with relatively high accuracy [17]. In addition, writing style usually can be
characterized using more common stylometric features such as character, lexical, and
syntactic information.
Recently, neural network based methods have enjoyed a resurgence in popularity in-
cluding word embeddings [2,10]. Word embeddings represent word in low-dimensional
 2
     http://scikit-learn.org/
 3
     https://radimrehurek.com/gensim/
vector based on its contexts. Thus the vector representation might capture not only
grammatical and syntactic information but also semantic feature of the word [19,7]
Training word embeddings Word embeddings were implemented on English and Dutch
(we were unable to implement word embeddings for Greek due to problems caused by
text encoding). For English, we used pre-trained Google word2vec vectors. The vec-
tors have dimensionality of 300 and were trained on 100 billion words from Google
News [11]. For Dutch, we used the implementation of word2vec tools from Gensim to
train word embeddings on 3.7Gb of Dutch texts. We set the dimensionality of word2vec
vectors to 300 and window size to 5. By default, Gensim word2vec uses continuous
bag-of-words architecture in the training. To overcome the problem of variable-length
documents, we simply used the average word vectors in a given document.
Character n-grams Sapkota et. al [16] provide evidence of how useful character n-
grams can be to capture the characteristics of author’s writing. We extracted 5000 most
frequent character n-grams, which include n ranging from 3 to 8. We then calculated
tf-idf score using TfIdfVectorizer from Scikit-Learn library. We did not apply any pre-
processing steps, meaning that all function words are included.


2.2     Clustering algorithm

One of the most important steps in author clustering is to determine the correct number
of clusters since this corresponds to the number of authors. It is especially challenging
in this task, since a large portion clusters consist of only one author. We chose K-Means
clustering and used the implementation from Scikit-Learn machine learning library.
To optimize the number of clusters, we performed hyperparameter tuning using the
Silhouette Coefficient.
Silhouette Coefficient The Silhouette Coefficient4 works by evaluating the clustering
model with different number of k. For each sample, a score is produced. Higher scores
correspond to a model with better defined clusters. Equation 3 describes how to calcu-
late the Silhouette Coefficient s for a single sample:
                                                 b−a
                                          s=                                          (1)
                                                max(a, b)
where:
a: The mean distance between a sample and all other points in the same class.
b: The mean distance between a sample and all other points in the next nearest cluster.

We calculated the Silhouette Coefficient on a range of values k and picked the value
with the highest score.
Authorship Links To produce authorship links score, we simply took the formed clus-
ters which consist of more than one member and calculated their pairwise similarity
using cosine similarity metric.
 4
     http://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient
2.3      Comparison Study

We ran our system on TIRA [13,4] with different feature settings on the training dataset.
The output was evaluated based on the BCubed and MAP metrics. Table 1 shows the
average results of author clustering on English and Dutch datasets. Overall, word em-
beddings perform as well as character n-grams. Our hypothesis is that the word embed-
dings successfully captured topic differences between author. However, implementation
of word embeddings is computationally expensive. It took almost 26 minutes of execu-
tion times and significant memory requirements (more than 10Gb) to perform the task.
While character n-grams only need 7 minutes and less than 500Mb of memory.
To confirm the effectiveness of word embeddings in authorship attribution, we suggest
that further experiment on larger corpus is needed. It would also be interesting to in-
vestigate whether word embeddings still achieve good performance on single-domain
corpus. After comparing the overall performance of both features, we decided to include
character n-grams in our final version of the software that was submitted for evaluation.


                Table 1: Author Clustering performance using different feature sets


                      character n-grams                          average word2vec
    Lang
           F-Bcubed R-Bcubed P-Bcubed        MAP     F-Bcubed R-Bcubed P-Bcubed       MAP

    en      0.76902   0.70771    0.86441 0.02100 0.76878         0.70774     0.86708 0.03752
    nl      0.80158   0.73102    0.91411 0.05950 0.79664         0.72507     0.91099 0.04678


3        Result and Discussion

We submitted our final software which used tf-idf of character n-grams for all three lan-
guages. Table 2 shows results of our system on the training data. On average, our system
obtained 0.795 on F-BCubed which indicates that it successfully identified the correct
cluster to most of the documents. However, the system failed to perform well when
evaluated using the MAP metric. Similar results were obtained on test data, with worse
MAP scores (see Table 3). Our system produced the 3rd and 4th best Mean F-Score and
MAP respectively in the ranking of all PAN 2016 Author Clustering participants.
MAP is only calculated on clusters containing at least two items. Thus, this score is gen-
erally very dependent on how accurately the system assigned clusters to the document
collection. High scores for the BCubed metrics but low scores on MAP indicate that the
system still not good enough at capturing similarities between documents. In addition,
we suspect the nature of the corpus (which contains a lot of single node clusters) is the
main reason why the system performs well on BCubed metrics. Experiments on larger
corpus with more non-single node clusters would be useful to explore this hypothesis.
       Table 2: Author Clustering performance using character n-grams on training data


             Problem       Lang    F-Bcubed    R-Bcubed     P-Bcubed     MAP

             problem001      en      0.747        0.700       0.800     0.0000
             problem002      en      0.662        0.537       0.863     0.0296
             problem003      en      0.846        0.873       0.820     0.0209
             problem004      en      0.817        0.731       0.925     0.0522
             problem005      en      0.875        0.875       0.875     0.0000
             problem006      en      0.668        0.530       0.903     0.0234

             average         en      0.769        0.708       0.864      0.021

             problem007      nl      0.895        0.924       0.868     0.1000
             problem008      nl      0.714        0.566       0.965     0.1062
             problem009      nl      0.795        0.731       0.871     0.0556
             problem010      nl      0.715        0.570       0.960     0.0779
             problem011      nl      0.785        0.685       0.920     0.0171
             problem012      nl      0.905        0.910       0.900     0.0000

             average         nl      0.802        0.731       0.914     0.0595

             problem013      gr      0.670        0.539       0.885     0.0439
             problem014      gr      0.779        0.703       0.873     0.0005
             problem015      gr      0.879        0.885       0.873     0.0625
             problem016      gr      0.928        0.965       0.895     0.5556
             problem017      gr      0.753        0.618       0.964     0.1434
             problem018      gr      0.868        0.806       0.939     0.2859

             average         gr      0.813        0.753       0.905     0.1827

             overall                 0.795        0.730       0.894     0.0877


4   Conclusion

We have presented our system which was submitted for PAN 2016 Author Clustering
task. We performed experiments using two different features: word embeddings and
character n-grams. Results from the experiments show that word embeddings are use-
ful predictive features especially for multi-topic authorship attribution. The utility of
word embeddings on capturing semantic information helps to identify the author of the
       Table 3: Author Clustering performance using character n-grams on testing data


            Problem        Lang    F-Bcubed    R-Bcubed     P-Bcubed     MAP

            problem001      en       0.779       0.714        0.857     0.0000
            problem002      en       0.678       0.529        0.943     0.0407
            problem003      en       0.885       0.914        0.857     0.0000
            problem004      en       0.809       0.739        0.894     0.0104
            problem005      en       0.887       0.900        0.875     0.0000
            problem006      en       0.666       0.533        0.888     0.0016

            average         en       0.784       0.722        0.886     0.0088

            problem007      nl       0.841       0.784        0.906     0.1361
            problem008      nl       0.877       0.896        0.859     0.0250
            problem009      nl       0.657       0.531        0.859     0.0026
            problem010      nl       0.899       0.890        0.910     0.0625
            problem011      nl       0.664       0.520        0.917     0.0263
            problem012      nl       0.793       0.710        0.900     0.0000

            average         nl       0.789       0.722        0.892     0.0421

            problem013      gr       0.808       0.743        0.886     0.0347
            problem014      gr       0.708       0.569        0.936     0.0596
            problem015      gr       0.871       0.886        0.857     0.0000
            problem016      gr       0.843       0.786        0.909     0.0619
            problem017      gr       0.899       0.929        0.871     0.0159
            problem018      gr       0.749       0.621        0.943     0.2412

            average         gr       0.813       0.756        0.900     0.0689

            overall                  0.795       0.733        0.893     0.0399


documents. However the PAN corpus is small and we suggest further experiments are
needed.

Our final submission implemented tf-idf character n-grams with K-Means clustering.
This simple approach has proved to be effective for author clustering. This year’s PAN
shared task has encouraged us to explore these approaches on other authorship attri-
bution tasks. We are interested to know how well character n-grams perform on other
problems such as large-scale or short text authorship attribution.


5    Acknowledgment

Both authors are grateful for the efforts of the CLEF PAN 2016 team. The first au-
thor would like to acknowledge Indonesia Endowment Fund for Education (LPDP) for
support in the form of a doctoral studentship.


References

 1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering
    evaluation metrics based on formal constraints. Inf. Retr. 12(4), 461–486 (Aug 2009),
    http://dx.doi.org/10.1007/s10791-008-9066-8
 2. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilis-
    tic language model. J. Mach. Learn. Res. 3, 1137–1155 (Mar 2003),
    http://dl.acm.org/citation.cfm?id=944919.944966
 3. Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Artificial Intelligence Applica-
    tions and Innovations: 3rd IFIP Conference on Artificial Intelligence Applications and Inno-
    vations (AIAI) 2006, June 7–9, 2006, Athens, Greece, chap. Source Code Author Identifica-
    tion Based on N-gram Author Profiles, pp. 508–515. Springer US, Boston, MA (2006)
 4. Gollub, T., Stein, B., Burrows, S., Hoppe, D.: TIRA: Configuring, Executing, and Dissemi-
    nating Information Retrieval Experiments. In: Tjoa, A., Liddle, S., Schewe, K.D., Zhou, X.
    (eds.) 9th International Workshop on Text-based Information Retrieval (TIR 12) at DEXA.
    pp. 151–155. IEEE, Los Alamitos, California (Sep 2012)
 5. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J.
    Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (Jan 2009), http://dx.doi.org/10.1002/asi.v60:1
 6. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval.
    45(1), 83–94 (Mar 2011), http://dx.doi.org/10.1007/s10579-009-9111-2
 7. Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the
    Twenty-Ninth AAAI Conference on Artificial Intelligence. pp. 2418–2424. AAAI’15, AAAI
    Press (2015), http://dl.acm.org/citation.cfm?id=2886521.2886657
 8. Ma, C., Xu, W., Li, P., Yan, Y.: Distributional representations of words for short text
    classification. In: Proceedings of the 2015 Conference of the North American Chapter
    of the Association for Computational Linguistics: Human Language Technologies. pp.
    33–38. Association for Computational Linguistics, Denver, Colorado (May–June 2015),
    http://www.aclweb.org/anthology/W15-1505
 9. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge
    University Press, New York, NY, USA (2008)
10. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in
    vector space. CoRR abs/1301.3781 (2013), http://arxiv.org/abs/1301.3781
11. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representa-
    tions of words and phrases and their compositionality. CoRR abs/1310.4546 (2013),
    http://arxiv.org/abs/1310.4546
12. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher,
    M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine
    Learning Research 12, 2825–2830 (2011)
13. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the Re-
    producibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and Au-
    thor Profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A.,
    Toms, E. (eds.) Information Access Evaluation meets Multilinguality, Multimodality, and
    Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). pp. 268–299.
    Springer, Berlin Heidelberg New York (Sep 2014)
14. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Corpora. In:
    Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pp.
    45–50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/publication/884893/en
15. Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of clus-
    ter analysis. Journal of Computational and Applied Mathematics 20, 53 – 65 (1987),
    http://www.sciencedirect.com/science/article/pii/0377042787901257
16. Sapkota, U., Bethard, S., Montes, M., Solorio, T.: Not all character n-grams are created equal:
    A study in authorship attribution. In: Proceedings of the 2015 Conference of the North Amer-
    ican Chapter of the Association for Computational Linguistics: Human Language Technolo-
    gies. pp. 93–102. Association for Computational Linguistics, Denver, Colorado (May–June
    2015), http://www.aclweb.org/anthology/N15-1010
17. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci.
    Technol. 60(3), 538–556 (Mar 2009), http://dx.doi.org/10.1002/asi.v60:3
18. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Pot-
    thast, M.: Clustering by Authorship Within and Across Documents. In: Working Notes Pa-
    pers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings, CLEF and CEUR-
    WS.org (Sep 2016)
19. Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., Hao, H.: Semantic cluster-
    ing and convolutional neural network for short text categorization. In: Proceedings of
    the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th
    International Joint Conference on Natural Language Processing (Volume 2: Short Pa-
    pers). pp. 352–357. Association for Computational Linguistics, Beijing, China (July 2015),
    http://www.aclweb.org/anthology/P15-2058