=Paper= {{Paper |id=Vol-1749/paper10 |storemode=property |title=Investigating the Application of Distributional Semantics to Stylometry |pdfUrl=https://ceur-ws.org/Vol-1749/paper10.pdf |volume=Vol-1749 |authors=Giulia Benotto,Emiliano Giovannetti,Simone Marchi |dblpUrl=https://dblp.org/rec/conf/clic-it/BenottoGM16 }} ==Investigating the Application of Distributional Semantics to Stylometry== https://ceur-ws.org/Vol-1749/paper10.pdf
    Investigating the Application of Distributional Semantics to Stylometry

                      Giulia Benotto, Emiliano Giovannetti, Simone Marchi
                       Istituto di Linguistica Computazionale “A. Zampolli”
                                 Consiglio Nazionale delle Ricerche
                                 Via G. Moruzzi 1, 56124, Pisa - Italy
                                {name.surname}@ilc.cnr.it


                      Abstract                         measuring some textual features, we can distin-
                                                       guish between texts written by different authors
     English. The inclusion of semantic fea-           (Stamatatos, 2009).
     tures in the stylometric analysis of liter-          One of the less investigated stylistic feature is
     ary texts appears to be poorly investigated.      the way in which authors use words from a seman-
     In this work, we experiment with the ap-          tic point of view, e.g. if they tend to use more,
     plication of Distributional Semantics to a        when dealing with polysemous words, a certain
     corpus of Italian literature to test if words     sense over the others, or senses that differ (even
     distribution can convey stylistic cues. To        slightly) from the one that’s more commonly used
     verify our hypothesis, we have set up an          (as it happens, typically, in poetry).
     Authorship Attribution experiment. In-               A possible approach to the analysis of this char-
     deed, the results we have obtained suggest        acteristic is to consider the textual contexts in
     that the style of an author can reveal itself     which certain words appear. According to Dis-
     through words distribution too.                   tributional Semantics (DS), certain aspects of the
     Italiano.      L’inclusione di caratteris-        meaning of lexical expressions depend on the dis-
     tiche semantiche nell’analisi stilomet-           tributional properties of such expressions, or bet-
     rica di testi letterari appare poco stu-          ter, on the contexts in which they are observed
     diata.     In questo lavoro, sperimenti-          (Lenci, 2008; Miller and Charles, 1991). The se-
     amo l’applicazione della Semantica Dis-           mantic properties of a word can then be defined by
     tribuzionale ad un corpus di letteratura          inspecting a significant number of linguistic con-
     italiana per verificare se la distribuzione       texts, representative of the distributional behavior
     delle parole possa fornire indizi stilis-         of such word.
     tici. Per verificare la nostra ipotesi, abbi-        In this work we would like to investigate if the
     amo imbastito un esperimento di Author-           analysis of the distribution of words in a text can
     ship Attribution. I risultati ottenuti sug-       be exploited to provide a stylistic cue. In order to
     geriscono che, effettivamente, lo stile di un     inspect that, we have experimented with the ap-
     autore pu rivelarsi anche attraverso la dis-      plication of DS to the stylometric analysis of liter-
     tribuzione delle parole.                          ary texts belonging to a corpus constituted by texts
                                                       pertaining to the work of six Italian writers of the
                                                       late nineteenth century.
1    Introduction                                         In the following, Section 2 gives a short in-
Stylometry, that is the application of the study of    sight on the state of the art of computational stylis-
linguistic style, offers a means of capturing the      tic analysis, Section 3 describes the approach to-
elusive character of an author’s style by quanti-      gether with the corpus used to conduct our inves-
fying some of its features. The basic stylometric      tigation and Section 4 discuss about results. Fi-
assumption is that each writer has certain stylis-     nally, Section 5 draws some conclusions and out-
tic idiosyncrasies (a “human stylome” (Van Hal-        lines some possible future works.
teren et al., 2005)) that define their style. Analy-
                                                       2   State of the Art
sis based on stylometry are often used for Author-
ship Attribution (AA) tasks, since the main idea       The very first attempts to analyze the style of an
behind computationally supported AA is that by         author were based on simple lexical features such
as sentence length counts and word length counts,       (Herbelot, 2015) it is argued how distributionalism
since they can be applied to any language and           can support the notion that the meaning of poetry
any corpus with no additional requirements (Kop-        comes from the meaning of ordinary language and
pel and Schler, 2004; Stamatatos, 2006; Zhao and        how distributional representations can model the
Zobel, 2005; Argamon et al., 2007). Similarly,          link between ordinary and poetic language. How-
character measures have been proven to be quite         ever, the role of DS in the study of a style of an
useful to quantify the writing style (Grieve, 2007;     author was not the aim of these works.
De Vel et al., 2001; Zheng et al., 2006). Basi-
cally, a text can be viewed as a mere sequence of       3       Experimental Setup
characters, so that various measures can be defined     First, we want to specify that it is not our purpose
(including alphabetic, digit, uppercase and low-        to propose new ways to improve state-of-the-art
ercase characters count, etc.). A more elaborate        AA algorithms. Indeed, our aim is just to verify
text representation method is to employ syntac-         the hypothesis that the distribution of words can
tic information (Gamon, 2004; Stamatatos et al.,        provide an indication of a distributional stylistic
2000; Stamatatos et al., 2001; Hirst and Feiguina,      fingerprint of an author. To do this, we have set up
2007; Uzuner and Katz, 2005). The idea is that          a simple classification task. Subsection 3.1 briefly
authors tend to use similar syntactic patterns un-      depicts the data set we used, while Section 3.2 de-
consciously. Therefore, syntactic information is        scribes the steps implemented in our experiment.
considered a more reliable authorial fingerprint in
comparison to lexical information.                      3.1       Data Set Construction
   More complicated tasks such as full syntactic        In order to build the reference and test corpora, we
parsing, semantic analysis, or pragmatic analy-         started from texts pertaining to the work of six Ital-
sis cannot yet be handled adequately by current         ian writers working at the turn of the 20th century,
NLP technologies for unrestricted text. As a re-        namely, Luigi Capuana, Federico De Roberto,
sult, very few attempts have been made to exploit       Luigi Pirandello, Italo Svevo, Federigo Tozzi and
high-level features for stylometric purposes. Per-      Giovanni Verga. We chose contiguous authors in
haps the most important method of exploiting se-        chronological sense, whose texts are available in
mantic information so far was described in (Arga-       digital format (in fact we could not do a similar
mon et al., 2007). This work was based on the the-      survey on the narrative of the 90s because it is still
ory of Systemic Functional Grammar (SFG) (Hal-          under copyrights). Indeed, we used texts freely
liday, 1994) and consisted on the definition of a set   available for download from the digital library of
of functional features that associate certain words     the Manunzio project, via the LiberLiber website1 .
or phrases with semantic information.                   Since they were encoded in various formats, such
   The previously described features are applica-       as .epub, .odt and .txt, our pre-processing con-
tion independent since they can be extracted from       sisted in converting them all in .txt format and get-
any textual data. Beyond that, one can define           ting rid of all xml tags, together with footnotes and
application-specific measures in order to better        editors’ notes and comments.
represent the nuances of style in a given text do-
                                                        3.2       Experiment Description
main (such as e-mail messages, or online forum
messages) (Li et al., 2006; Teng et al., 2004).         According to Rudman (1997), a striking problem
                                                        in stylometry is due to the lack of homogeneity
   To the best of our knowledge, the application of
                                                        of the examined corpora, in particular to the im-
DS to the analysis of literary texts has been docu-
                                                        proper selection or fragmentation of the texts, that
mented in a rather small number of works (Buite-
                                                        might cause alterations in the writers’ style. In or-
laar et al., 2014; Herbelot, 2015). In both these
                                                        der to create balanced reference corpora, i.e. cov-
works, DS is used as a theoretical basis in order
                                                        ering all the authors’ different stylistic and the-
to verify some hypotheses on specific semantic
                                                        matic phases, for each author, as shown in Fig-
characteristics of poetic works. In more details,
                                                        ure 1, we built a reference corpus as the compo-
in (Buitelaar et al., 2014) the authors investigated
                                                        sition of the 70% of each single work (usually a
through DS the influence of Lord Byron’s work
                                                        novel). The same technique was used to create the
on Thomas Moore trying to find a shared vocab-
                                                            1
ulary or specific formal textual characteristics. In            http://www.liberliber.it/
                                                                 tween the vectors representing the two words of
                                                                 each pair. This allowed us to evaluate the seman-
                                                                 tic relatedness between the words by assessing
                                                                 their proximity in the distributional space as rep-
                                                                 resented by the cosine value: the more this value
                                                                 tends to 1, the more the two words of the pair are
                                                                 considered to be related. We then obtained two
                                                                 related word pair (RWP) lists for each author A:
                                                                 RWPref A and RWPtest A . Figure 1 shows the pro-
                                                                 cess described above.
                                                                    Since we wanted to focus on the analysis of the
                                                                 semantic distribution of words, we decided to ex-
                                                                 clude any possible “lexical bias”. For this reason,
Figure 1: RWPref and RWPtest creation process                    we restricted the analysis on a common vocabu-
for an author.                                                   lary, i.e. a vocabulary constituted by the inter-
                                                                 section of the six authors’ vocabularies. In this
                                                                 way, we prevent our classifier to exploit, as a fea-
test corpus by using the remaining 30% of each
                                                                 ture, the presence of words used by some (but not
work. Typical AA approaches consist in analyzing
                                                                 all) of the authors. Moreover, we removed from
known authors and assigning authorship to previ-
                                                                 the RWPtest lists all those pairs of words occurring
ously unseen text on the basis of various features.
                                                                 frequently together in the same context, since they
Train and test sets should then contain different
                                                                 might constitute a multiword expression that, once
texts. Contrary to the classical AA task, our train
                                                                 again, could be pertaining with the signature lex-
and test sets contain different parts of the same
                                                                 icon of each author. To remove them, we com-
texts. Indeed, with this experiment, we wanted to
                                                                 puted the number of times (#co-occ in Table 1)
understand if the semantics that an author bestows
                                                                 they appeared together in the context window, as
to a word, is peculiar to his writing. To prove this,
                                                                 well as their total number of occurrences (#occa
we wanted to cover all the different stylistic and
                                                                 and #occb ) and we excluded from the analysis
thematic phases an author can go through during
                                                                 those pairs for which the ratio between the num-
his activity, hence the partition of all his texts in a
                                                                 ber of co-occurrences and the total occurrences of
reference and a test portion.
                                                                 the less frequent word was higher than the empir-
   We then analyzed each reference and test cor-                 ically set threshold of 0.5. The first two pairs of
pora with a Part-of-Speech (PoS) tagger and a lem-               Table 1 would be removed as probable multiword
matizer for Italian (Dell’Orletta et al., 2014). For             (PM column in Table 1): “scoppio” (burst) and
every author, we built two lists of word pairs (with             “risa” (laughter) could mostly co-occur in “scop-
their lemma and PoS), one relative to the tagged                 pio di risa” (meaning “burst of laughter”) and the
reference corpus (reference pairs) and the other to              words “man” and “mano” (both meaning “hand”)
the tagged test set (test pairs), where each word                could mostly co-occur in “man mano” (meaning
was paired with all the other words with the same                “little by little”, or “progressively”).
PoS. We also filtered the pairs to leave only nouns,
adjectives and verbs. Starting from the tagged cor-                Wa          Wb           #occa   #occb   #co-occ   ratio   PM
pora, we built two words-by-words matrixes2 of                     scoppio–s   risa–s         19        9        7    0.78    yes
co-occurrence counts (co-occurrence matrixes) for                  man–n       mano–n         50    1325        47    0.94    yes
each author, using a context window of 43 . The                    nausea–n    disgusto–n     27      26         0       0    no

chosen DS model (Baroni and Lenci, 2010) was                       piccolo–a   grande–a      248     237        14    0.06    no

applied to each matrix to calculate the cosine be-
                                                                 Table 1: An example of co-occurring RWs from
   2
      Being the corpus relatively small and not having partic-   Pirandello’s test list: the first two pairs would be
ular computability issues, we chose not to apply decomposi-      removed.
tion techniques to reduce the size of the matrixes (and thus
not losing any information).                                       Finally, we reduced the size of the six RWPref
    3
      We performed different empiric setup of the window’s
size and chose the one that showed more suitable results, ac-    and RWPtest lists by sorting them in decreasing or-
cording to what is stated by Kruszewski and Baroni (2014).       der of the cosine value and then by keeping the
pairs with the highest cosine, selected using a per-                                          0.5%              1%                2%                5%

centage parameter θ as threshold4 . We chose to                            Capuana         Capuana           Capuana           Capuana         Capuana

introduce the parameter θ for two reasons: i) to                           De Roberto      De Roberto        De Roberto        De Roberto      De Roberto
avoid the classification algorithm to be disturbed                         Pirandello      Pirandello        Pirandello        Pirandello      Pirandello
by noisy (i.e. not significative) pairs which would                        Svevo           Svevo             Svevo             Svevo           Svevo
not hold any relevant stylistic cue, and ii) to ease
                                                                           Tozzi           Verga             Verga             Tozzi/Verga     Tozzi
a literary scholar in the interpretation of the re-
                                                                           Verga           Verga             Verga             Verga           Verga
sults by having to analyze just a limited selection
of (potentially) semantically related word pairs.
                                                                       Table 3: Results of the classification. Classifica-
   For the last phase of our experiment we defined
                                                                       tion errors are highlighted.
a classification algorithm to test the effective pres-
ence of stylistic cues inside the obtained RWPtest                                                           0.5%         1%       2%        5%
lists. We defined a classifier using a nearest-cosine                                #RWPtest   Capuana
                                                                                                               678        1357    2714       6785
method to attribute each test list to an author.                                                De Roberto
                                                                                     #RWPtest                  488         977    1954       4886
The method consisted in searching for a pair of
                                                                                     #RWPtest Pirandello       692        1385    2770       6925
words contained in the test list inside each refer-
                                                                                                Svevo
                                                                                     #RWPtest                  425         851    1702       4257
ence list and incrementing by 1 the score of the
                                                                                                Tozzi
author whose reference list included the pair with                                   #RWPtest                  246         493     986       2466

the more similar cosine value (i.e. having the min-                                  #RWPtest   Verga
                                                                                                               526        1053    2106       5267
imum difference): the chosen author was the one
with the highest score. Table 2 shows the classifi-                    Table 4: Cardinality of RWPtest for each author
cation results for θ = 5%.                                             and for each θ value.

              Capuana    De       Pirandello   Svevo   Tozzi   Verga
                                                                       test list (RWPtext Tozzi ) as shown in Table 4. It is
                        Roberto

 Capuana       1884      1269       1321       797     755     1054
                                                                       apparent that increasing the value of θ and con-
 De Roberto     729      1041        712       498     451      579
                                                                       sequently the number of significant RW pairs that
 Pirandello    1387      1278       2114       937     747     1056
                                                                       are analysed, the system is able to correctly clas-
 Svevo          353      371         341       593     372      356
                                                                       sify RWPtest Tozzi (see the values in Tozzi’s row of
 Tozzi          199      219         183       242     281      244
                                                                       Table 3).
 Verga          650      671         656       473     430      851
                                                                       5      Conclusion and Next Steps
Table 2: Classification results, obtained via the                      In this paper we investigated the possibility that
nearest-cosine method for θ = 5%.                                      an analysis of the semantic distribution of words
                                                                       in a text can be potentially exploited to get cues
                                                                       about the style of an author. In order to vali-
4       Interpreting the Results                                       date our hypothesis, we conducted a first experi-
As summarized in Table 3, a correct classification                     ment on six different Italian authors. The results
of all RWPs in RWPtest lists has been obtained                         seem to suggest that the way words are distributed
with a θ value of 5%.                                                  across a text, can provide a valid stylistic cue to
   To help in interpreting the failure of the algo-                    distinguish an author’s work. Of course, it is not
rithm in classifying Tozzi’s test list for θ values                    our intent, with this paper, to define new methods
lower than 5% (as shown in Table 3) we calculated                      for enhancing state-of-the-art authorship attribu-
the cardinality of the RWPtest lists for each author                   tion algorithms. Our research will focus, in the
with the change in θ value (Tables 4).                                 next steps, in detecting and providing useful indi-
   It is possible to observe how the choice of θ in-                   cations about the style of an author. This can be
fluences the correct classification of Tozzi’s test                    done by highlighting, for example, atypical dis-
list. Indeed, the use of a θ value below 5% has                        tributions of words (e.g. with contrastive meth-
the effect of remarkably reducing an already small                     ods) or by analysing their distributional variability.
                                                                       Furthermore, it could be interesting to use a differ-
    4
     At the following url we have uploaded an archive con-             ent distributional measure, than the cosine, to test
taining all the data we have used and processed for our exper-         our hypothesis.
iment: https://goo.gl/nrTqWh
References                                                 Jiexun Li, Rong Zheng, and Hsinchun Chen. 2006.
                                                              From fingerprint to writeprint. Communications of
Shlomo Argamon, Casey Whitelaw, Paul Chase, Sob-              the ACM, 49(4):76–82.
  han Raj Hota, Navendu Garg, and Shlomo Levitan.
  2007. Stylistic text classification using functional     George A Miller and Walter G Charles. 1991. Contex-
  lexical features. Journal of the American Society          tual correlates of semantic similarity. Language and
  for Information Science and Technology, 58(6):802–         cognitive processes, 6(1):1–28.
  822, April.
                                                           Joseph Rudman. 1997. The state of authorship attri-
Marco Baroni and Alessandro Lenci. 2010. Dis-                 bution studies: Some problems and solutions. Com-
 tributional memory: A general framework for                  puters and the Humanities, 31(4):351–365.
 corpus-based semantics. Computational Linguis-
 tics, 36(4):673–721.                                      Efstathios Stamatatos, Nikos Fakotakis, and George
                                                             Kokkinakis. 2000. Automatic text categorization in
Paul Buitelaar, Nitish Aggarwal, and Justin Tonra.           terms of genre and author. Computational linguis-
  2014. Using distributional semantics to trace influ-       tics, 26(4):471–495.
  ence and imitation in romantic orientalist poetry. In
  AHA!-Workshop 2014 on Information Discovery in           Efstathios Stamatatos, Nikos Fakotakis, and Georgios
  Text. ACL.                                                 Kokkinakis. 2001. Computer-based authorship at-
                                                             tribution without lexical measures. Computers and
Olivier De Vel, Alison Anderson, Malcolm Corney, and         the Humanities, 35(2):193–214.
  George Mohay. 2001. Mining e-mail content for au-
  thor identification forensics. ACM Sigmod Record,        Efstathios Stamatatos. 2006. Authorship attribution
  30(4):55–64.                                               based on feature set subspacing ensembles. In-
                                                             ternational Journal on Artificial Intelligence Tools,
Felice Dell’Orletta, Giulia Venturi, Andrea Cimino,          15(05):823–838.
  and Simonetta Montemagni. 2014. T2kˆ 2: a
                                                           Efstathios Stamatatos. 2009. A survey of modern au-
  system for automatically extracting and organizing
                                                             thorship attribution methods. J. Am. Soc. Inf. Sci.
  knowledge from texts. In LREC, pages 2062–2070.
                                                             Technol., 60(3):538–556, March.
Michael Gamon. 2004. Linguistic correlates of style:       Gui-Fa Teng, Mao-Sheng Lai, Jian-Bin Ma, and Ying
  authorship classification with deep linguistic analy-      Li. 2004. E-mail authorship mining based on svm
  sis features. In Proceedings of the 20th international     for computer forensic. In Machine Learning and
  conference on Computational Linguistics, page 611.         Cybernetics, 2004. Proceedings of 2004 Interna-
  Association for Computational Linguistics.                 tional Conference on, volume 2, pages 1204–1207.
                                                             IEEE.
Jack Grieve. 2007. Quantitative Authorship Attribu-
   tion: An Evaluation of Techniques. Literary and         Özlem Uzuner and Boris Katz. 2005. A comparative
   Linguistic Computing, 22(3):251–270, May.                  study of language models for book and author recog-
                                                              nition. In Natural Language Processing–IJCNLP
Michael AK Halliday. 1994. Functional grammar.                2005, pages 969–980. Springer.
  London: Edward Arnold.
                                                           Hans Van Halteren, Harald Baayen, Fiona Tweedie,
Aurélie Herbelot. 2015. The semantics of poetry: A          Marco Haverkort, and Anneke Neijt. 2005. New
  distributional reading. Digital Scholarship in the         machine learning methods demonstrate the exis-
  Humanities, 30(4):516–531.                                 tence of a human stylome. Journal of Quantitative
                                                             Linguistics, 12(1):65–77.
Graeme Hirst and Olga Feiguina. 2007. Bigrams of
  Syntactic Labels for Authorship Discrimination of        Ying Zhao and Justin Zobel. 2005. Effective and scal-
  Short Texts. Literary and Linguistic Computing,            able authorship attribution using function words. In
  22(4):405–417, September.                                  Information Retrieval Technology, pages 174–189.
                                                             Springer.
Moshe Koppel and Jonathan Schler. 2004. Authorship
 verification as a one-class classification problem. In    Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan
 Proceedings of the twenty-first international confer-       Huang. 2006. A framework for authorship identi-
 ence on Machine learning, page 62. ACM.                     fication of online messages: Writing-style features
                                                             and classification techniques. Journal of the Ameri-
Germán Kruszewski and Marco Baroni. 2014. Dead              can Society for Information Science and Technology,
  parrots make bad pets: Exploring modifier effects in       57(3):378–393, February.
  noun phrases. Lexical and Computational Seman-
  tics (* SEM 2014), page 171.

Alessandro Lenci. 2008. Distributional semantics in
  linguistic and cognitive research. Italian journal of
  linguistics, 20(1):1–31.