Towards Authorship Attribution
                     for Bibliometrics using Stylometric Features

                     Andi Rexha, Stefan Klampfl, Mark Kröll and Roman Kern

                             {arexha, sklampfl, mkroell, rkern}@know-center.at
                         Know-Center GmbH, Inffeldgasse 13, A-8010 Graz (Austria)


Abstract
The overwhelming majority of scientific publications are authored by multiple persons; yet, bibliographic
metrics are only assigned to individual articles as single entities. In this paper, we aim at a more fine-grained
analysis of scientific authorship. We therefore adapt a text segmentation algorithm to identify potential author
changes within the main text of a scientific article, which we obtain by using existing PDF extraction techniques.
To capture stylistic changes in the text, we adopt a number of stylometric features. We evaluate our approach on
a small subset of PubMed articles consisting of an approximately equal number of research articles written by a
varying number of authors. Our results indicate that the more authors an article has the more potential author
changes are identified. These results can be considered as an initial step towards a more detailed analysis of
scientific authorship, thereby extending the repertoire of bibliometrics.


Conference Topic
Methods and techniques


Introduction
Bibliometrics has had to face the ever growing amount of scientific output in recent years -- a
challenge as well as a great opportunity. Techniques from other fields such as computer
linguistics have been taken over (i) to speed up measuring processes as well as to (ii) to
introduce novel ideas. In this paper we propose authorship attribution as additional method for
bibliometrics. So far, authorship of a scientific article has been attributed to the given authors
in a more or less unchallenged way. The extent of authorship is in general defined by
community standards, for instance, it is in many scientific domains assumed that the lead
author did most of the (writing) work and the last author contributed ideas being the head of
the group. Applying authorship attribution methods enables us to attribute particular segments
of an article to individual authors thereby analysing scientific authorship on a more fine-
grained level. We would like to get more insights into writing style habits of scientists, for
instance: Is there a preferred partitioning amongst authors? Is there a relation to the author
ordering? In addition, these methods may also have the potential to measure whether the
distribution of credit within a community or a research group is just.
As a first step into this direction, we seek to identify author changes within text passages. We
thus apply TextSeqFault (Kern & Granitzer, 2009), an algorithm for intrinsic plagiarism
detection - a line of research exhibiting a closely related problem setting. The algorithm was
originally developed to detect changes in topics in order to apply text segmentation. To be
applicable for authorship attribution, we adapted the algorithm to catch writing style changes
by taking into account stylometric features. To evaluate our approach, we created a small
subset of PubMed research articles. This data set consists of an approximately equal number
of research articles for certain number of authors, ranging from one to four. In our
experiments we could show that there is a tendency of a correlation between the number of
authors and the stylomentric differences within the text.
Background
Coined by Alan Pritchard in 1969, bibliometrics in general seeks to measure science by
providing methods to explore, for example, the impact of a particular publication. Citation
analysis (cf. Garfield (1972)) represents one common method being an expression for simply
counting a scientific article's citations which can be regarded as indicator for an article's
scientific impact.
To face the ever growing amount of written publications, there was an increased interest in
automating these methods by including ideas and techniques from other domains such as
computer linguistics and network analysis. To that end, linguistic resources such as the ACL
Anthology Reference Corpus (Bird et al., 2008) were compiled for standardization as well as
comparison purposes with respect to research problems including reference analysis (cf.
(Peng & McCallum, 2004)), citation classification (cf. (Teufel, Siddharthan & Tidhar, 2006))
and generation of summaries (cf. (Elkiss et al., 2008)).
In this paper we introduce authorship attribution as an additional method for bibliometrics.
Authorship attribution (cf. Stamatatos (2009), Juola (2008)) expresses a classification setting
where from a set of candidate authors, the author of a questioned article is to be selected. This
line of research can be traced back to the 19th century, when Mendenhall (1887) aimed to
characterize the plays of Shakespeare. A century later (Mosteller & Wallace, 1964) used a
Bayesian approach to analyse ‘The Federalist Papers’. Since then, a line of research known as
'stylometry' focused on defining features to quantify an author's writing style Holmes (1998)
including (i) lexical features such as average word/sentence length and vocabulary richness,
(ii) syntactical features such as frequency of function words and use of punctuation and (iii)
structural features such as indentation. (Bergsma, Post & Yarowsky, 2012) used stylometric
features to detect the gender, native speaker vs. non-native speaker and conference vs.
workshop paper.

Experimental Setup

Dataset
For the evaluation we use a dataset composed of randomly selected documents from PubMed
(http://www.ncbi.nlm.nih.gov/pubmed/), a free database created by the US National Library
of Medicine holding full-text articles from the biomedical domain together with a standard
XML mark-up that rigorously annotates the complete content of the published document, in
particular the author metadata. The documents contained in this database are very diverse. In
this work we focus on research articles only, but there is also a wide range of different article
types, including book reviews and meeting reports.
For this evaluation we selected a small subset of the PubMed dataset consisting of an
approximately equal number of research articles written by a certain number of authors,
ranging from one to four. For our preliminary evaluation, we chose 10 research articles for
each number of authors the BMC Bioinformatics journal – in total 40 articles.

PDF Extraction
A prerequisite for the analysis of the writing style of scientific articles is the reliable
extraction of their textual content. The portable document format (PDF), the most common
format for scientific literature today, is optimised for presentation, but lacks structural
information. As the raw character stream of the PDF is usually interrupted in mid-sentence by
decorations or floating objects, extracting the main text of a scholarly article in the correct
order requires the analysis of its document structure. To solve this task we build here upon
our previous work (Klampfl et al., 2014), where we have developed an unsupervised
processing pipeline that analyses the structure a PDF document using a number of both
supervised and unsupervised machine learning techniques and heuristics. It processes a given
PDF file in a sequence of individual processing modules and outputs the extracted body text.
The first step builds upon the output of the Apache PDFBox library (http://pdfbox.apache.org)
and uses unsupervised learning (clustering) to extract blocks of contiguous text from the raw
PDF file and their column-wise reading order on each page. We consider these text blocks as
the basic building blocks of a scientific article. In the next stage, these text blocks are
categorized into different logical labels based on their role within the document: meta-data
blocks, decorations, figure and table captions, main text, and section headings. This stage is
implemented as a sequential pipeline of detectors each of which labels a specific type of
block. Apart from the meta-data detectors they are completely model-free and unsupervised.
For more details on each of these detectors the interested reader is referred to (Klampfl et al.,
2014). In the final stage of our PDF extraction pipeline the main body text of a scientific
article is extracted by concatenating blocks containing section headings and main text in the
reading order. Furthermore we resolve hyphenations at the end of lines and across blocks,
columns, and pages.

Text Segmentation
Our intrinsic plagiarism detection algorithm is based on a sliding window approach, originally
developed for text segmentation. Text segmentation is applied in order to reconstruct
individual document borders of a single, long document that was constructed by
concatenating multiple textual documents, e.g., transcripts of spoken text. The majority of
techniques for text segmentation are designed to detect changes in topics (Choi, 2000; Dias &
Alves, 2005). Our text segmentation algorithm (Kern & Granitzer, 2009), named
TextSeqFault, is a derivative of the well-known TextTiling algorithm, proposed by Hearst
(1997), and also falls into this category.
For each position within the document, preceding and succeeding consecutive sentences are
combined into two adjacent sliding windows, which are then compared in a vector space. A
dissimilarity measure calculates the relative difference between their inner similarity (the
average pairwise similarity of sentences within the two windows) and their outer similarity
(the average pairwise similarity of sentences across the two windows). This dissimilarity
value is positive if the outer similarity is lower than the inner similarity, which indicates a
potential topic change. The maximum value of 1 is reached if the outer similarity is zero,
which is the case if the blocks correspond to orthogonal vectors. A topic change is reported
when the dissimilarity exceeds a predefined threshold. As a similarity measure between two
sentences we chose the common cosine similarity because of its simplicity and efficiency.

Stylometric Features
In the original TextSeqFault algorithm (Kern & Granitzer, 2009) the features used to detect a
change in topic are directly derived from the words within the sentences, i.e., by building a
vector space of unigrams. We adapted the algorithm for the domain of intrinsic plagiarism
detection by using a different set of features. Instead of topical features, such as word
unigrams or other elements carrying semantic information, we made use of stylometric
features, as we expected that topical features will be limited to work in cases where not only
the authorship, but also the whole topic of the text dramatically changes. These stylometric
features were chosen to reflect the style of the author, rather than the topic, which typically
does not change within a single scientific article. In literature a wide array of stylometric
features have been proposed (Mosteller & Wallace, 1964; Tweedie & Baayen, 2002;
Stamatatos, 2009). Stylometric features have also been put to use in a number of use cases,
e.g. for author profiling (Koppel, Argamon & Shimoni, 2002) and vandalism detection
(Harpalani et al., 2011). Table 1 shows the stylometric features used in our algorithm.
          Table 1: List of stylometric features used in our text segmentation algorithm.
                Many of those features are defined in (Tweedie & Baayen, 2002)
  feature name                 Description
  alpha-chars-ratio            the fraction of total characters in the paragraph which are letters
  digit-chars-ratio            the fraction of total characters in the paragraph which are digits
  upper-chars-ratio            the fraction of total characters in the paragraph which are upper-case
  white-chars-ratio            the fraction of total characters in the paragraph which are whitespace
                               characters
  type-token-ratio             ratio between the size of the vocabulary (i.e., the number of different
                               words) and the total number of words
  hapax-legomena               the number of words occurring once
  hapax-dislegomena            the number of words occurring twice
  yules-k                      a vocabulary richness measure defined by Yule
  simpsons-d                   a vocabulary richness measure defined by Simpson
  brunets-w                    a vocabulary richness measure defined by Brunet
  sichels-s                    a vocabulary richness measure defined by Sichel
  honores-h                    a vocabulary richness measure defined by Honore
  average-word-length          average length of words in characters
  average-sentence-char-length average length of sentences in characters
  average-sentence-word-length avarage length of sentences in words


Evaluation
In order to produce a preliminary evaluation we decided to have a visual landscape of the
dissimilarity within documents. For each of the analysed documents we calculate a
stylometric dissimilarity among two adjacent sliding windows containing thirty sentences
each. To show the results of this step in a larger scale, we multiply them with a scaling factor
of 10.000. Furthermore we have normalized the length of the documents, where each position
in the chart represent the dissimilarity of the relative position in the document.


        Figure 1: Landscape of the writing style dissimilarity for papers with different number of authors
       Figure 2: Comparison of writing style dissimilarity among papers with different number of authors


Below we show two types of charts that aim to illustrate the style change among papers
within the same category(with same number of authors) as well as a comparison among
articles with different numbers of authors which aims to show a correlation between the
number of authors and the dissimilarity of the writing style.
As illustrated in the Figure 1, there is a tendency of higher changes of writing style with the
growing number of authors. The number of high peaks (which represent a big change of the
writing style) grows with the growing of the amount of the authors for the paper.
The inspection of the Figure 2 highlights the differences between papers written by different
amount of authors. The papers with one and two authors tend to have a flat shape showing a
small dissimilarity within the document. On the other hand the papers with three and four
authors are inclined to have bigger and larger variations of writing style. In a closer look, also
the document with four authors shows the tendency of higher number of large dissimilarity
compared to the three authors paper.

Conclusion
In this paper, we proposed to add authorship attribution methods to the repertoire of
bibliometrics thereby enabling a more fine-grained analysis of authorship. As a first step into
this direction we presented an algorithm to segment scientific articles according to writing
style changes. Our preliminary results corroborate the natural assumption that in most cases
the more authors contribute the more author changes are identified. In future work, we will
extend our evaluation to more articles across topics as well as across journals. In addition, we
intend to learn classification models for individual authors capturing the respective writing
style trying to associate each part to the individual author. This feature might be used to credit
differently the contribution of each author to the paper.

Acknowledgments
This work is funded by the KIRAS program of the Austrian Research Promotion Agency
(FFG) (project number 840824). The Know-Center is funded within the Austrian COMET
Program under the auspices of the Austrian Ministry of Transport, Innovation and
Technology, the Austrian Ministry of Economics and Labour and by the State of Styria.
COMET is managed by the Austrian Research Promotion Agency FFG.
References
Bergsma, S., Post, M., & Yarowsky, D. (2012). Stylometric analysis of scientic articles. Proceedings
   of the Conference of the North American Chapter of the Association for Computational Linguistics:
   Human Language Technologies.
Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M., Lee,D. Powley,B., Radev, D. & Fan Tan,
   Y. (2008). The ACL anthology reference corpus: A reference dataset for bibliographic research in
   computational linguistics. Proceedings of the 6th International Conference on Language Resources
   and Evaluation Conference (LREC08), pages 1755–1759.
Choi, F.Y. (2000). Advances in domain independent linear text segmentation. Proceedings of the 1st
   North American chapter of the Association for Computational Linguistics conference. pp. 26-33.
Dias, G. & Alves, E. (2005). Unsupervised topic segmentation based on word co-occurrence and
   multi-word units for text summarization. Proceedings of the ELECTRA Workshop associated to
   28th ACM SIGIR Conference, Salvador, Brazil. pp. 41-48.
Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D. & Radev, D. (2008). Blind men and elephants:
   what do citation summaries tell us about a research article? Journal of the American Society for
   Information Science and Technology, 59(1):51–62.
Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science (178).
Juola, P. (2008). Authorship attribution. Foundations and Trends R in Information Retrieval, 1.
Harpalani, M., Hart, M., Singh, S., Johnson, R. & Choi, Y. (2011). Language of vandalism: Improving
   wikipedia vandalism detection via stylometric analysis. Proceedings of the 49th Annual Meeting of
   the Association for Computational Linguistics: Human Language Technologies. pp. 83-88.
Hearst, M.A. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages.
   Computational linguistics 23(1), 33-64.
Holmes, D. (1998). The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic
   Computing, 13(3):111–117.
Kern, R. & Granitzer, M. (2009). Efficient linear text segmentation based on information retrieval
   techniques. In: Proceedings of the International Conference on Management of Emergent Digital
   EcoSystems. p. 25.
Klampfl, S., Granitzer, M., Jack, K. & Kern, R. (2014). Unsupervised document structure analysis of
   digital scientific articles. International Journal on Digital Libraries 14(3-4), 83-99.
Mendenhall, T. (1887). The characteristic curves of composition. Science, ns-9(214S):237–246.
Mosteller, F. & Wallace, D. (1964). Inference and Disputed Authorship: The Federalist. Addison-
   Wesley.
Peng, F. & McCallum, A. (2004). Accurate information extraction from research papers using
   conditional random fields. Proceedings of Human Language Technology Conference / North
   American Chapter of the Association for Computational Linguistics, pages 329–336.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American
   Society for Information Science and Technology, 60(3):538–556.
Teufel, S. Siddharthan, A. & Tidhar, D. (2006). Automatic classification of citation function. In
   Proceedings of the International Conference on Empirical Methods in Natural Language
   Processing, pages 103–110.
Tweedie, F. & Baayen, H. (1998). How variable may a constant be? Measures of lexical richness in
   perspective. Computers and the Humanities. pp. 323-352.