=Paper=
{{Paper
|id=Vol-1410/paper5
|storemode=property
|title=A Peculiarity-based Exploration of Syntactical Patterns: a Computational
Study of Stylistics
|pdfUrl=https://ceur-ws.org/Vol-1410/paper5.pdf
|volume=Vol-1410
|dblpUrl=https://dblp.org/rec/conf/pkdd/BoukhaledFG15
}}
==A Peculiarity-based Exploration of Syntactical Patterns: a Computational
Study of Stylistics==
A Peculiarity-based Exploration of Syntactical Patterns:
a Computational Study of Stylistics
Mohamed-Amine Boukhaled, Francesca Frontini, Jean-Gabriel Ganascia
LIP6 (Laboratoire d’Informatique de Paris 6), Université Pierre et Marie Curie and CNRS
(UMR7606), ACASA Team, 4, place Jussieu,
75252-PARIS Cedex 05 (France)
{mohamed.boukhaled, francesca.frontini, jean-
gabriel.ganascia}@lip6.fr
Abstract. In this contribution, we present a computational stylistic
study and comparison of classic French literary texts based on a data-
driven approach where discovering interesting linguistic patterns is done
without any prior knowledge. We propose an objective measure capable
of capturing and extracting meaningful stylistic syntactic patterns from
a given author’s work. Our hypothesis is based on the fact that the most
relevant syntactic patterns should significantly reflect the author’s stylis-
tic choice and thus they should exhibit some kind of peculiar overrepre-
sentation behavior controlled by the author’s purpose with respect to a
linguistic norm. The analyzed results show the effectiveness in extracting
interesting syntactic patterns from novels, and seem particularly promis-
ing for the analysis of such particular texts.
Keywords: Computational Stylistics, Interestingness Measure, Sequen-
tial Pattern Mining, Syntactic Style
1 Introduction
Computational stylistics is a subdomain of computational linguistics located
at the intersection of several research areas such as natural language pro-
cessing, literary analysis and data mining. The goal of computational stylistics
is to extract style patterns characterizing a particular type of texts using
computational and automatic methods (Craig 2004). When investigating the
writing style of a particular author, the task will automatically explore lin-
guistic forms of his style, which is not only distinguishing features, but also
the deliberate overuse of certain structures by the author compared to a lin-
guistic norm (Mahlberg 2012). However, the notion of style in the context of
computational stylistics appears to be wide enough, and is manifested on sev-
eral linguistic levels: lexicon, syntax, semantics and pragmatics. Each level has
its own markers of styles and its own linguistic units that characterize it.
In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Nancy, France, 2014.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
32 M-A. Boukhaled, F. Frontini and J-G. Ganascia
Many works have been done in the literature to analyze the stylistic traits on
these different linguistic levels ( Biber 2006, Biber & Conrad 2009, Ramsay
2011, Frontini et al. 2014; see Siemens & Schreibman, 2013 for a discussion
and overview ). In this contribution, syntactic style will be targeted.
In their study Quiniou et al. (2012) have shown the interest of using sequen-
tial data mining methods for the stylistic analysis of large texts. They have
shown that relevant and understandable patterns that are characteristic of a
specific type of text can be extracted using sequential data mining techniques
such as sequential pattern mining.
However, the process of extracting textual patterns is known by its property
of producing a large amount of patterns, even from a relatively small sample
of text. Thus, a measure of interest is to be applied to identify the most im-
portant and relevant patterns for the characterization of the text’s style in
question.
In this paper, we present a computational stylistic study of classic texts of
French literature based on a data-driven approach where the discovery of
interesting linguistic forms is done without any prior knowledge. Specifically,
the proposed method is based on the assessment of the peculiar over-
representation of syntactic patterns extracted using sequential data mining
technique from texts with respect to a norm corpus. This method is intended
to quantitatively support a textual analysis by focusing on the verification of
the degree of importance of each syntactic pattern (syntagmatic segments
with potential gaps), and by extracting the syntactic patterns that character-
ize the syntactical style of a work by a particular author.
2 Approach for extracting relevant syntactic patterns
Our method consists of two steps. First, a sequential pattern mining algo-
rithm is applied to the texts in order to extract recurrent syntactic patterns.
Second, a peculiarity-based interestingness measure that evaluates of the
overrepresentation (in terms of frequency of occurrence with respect to a norm
corpus) is applied to the set of extracted syntactic patterns. Thus, each syn-
tactic pattern will be assigned an interestingness value indicating its im-
portance and its relevance for the characterization of text’s syntactic style. In
what follows, we present in section 2.1 the corpus used in our experience, and
its dividing protocol into two parts: text to analyze and text used as norm.
Then, section 2.2 introduces some elements necessary to understand the pro-
cess of extracting sequential syntactic patterns. Finally, the formulation and
the statistical details of the proposed interestingness measure are presented in
Section 2.3.
A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 33
2.1 Analyzed Corpus
In our study, we used four novels, belonging to the same genre and the same
literary time span, written by four famous classic French authors: Balzac’s
“Eugenie Grandet”, Flaubert's “Madame Bovary”, Hugo’s “Notre Dame de
Paris” and Zola’s “Le ventre de Paris”. This choice is motivated by our partic-
ular interest in studying the style of the classical French literature of the 19th
century. At the time of the analysis of the syntactic patterns, each text writ-
ten by one of the four authors is contrasted with texts written by the three
other authors. That is to say that these three texts will be considered as norm
corpus from which we will evaluate the hypothesis of the overrepresentation of
syntactic patterns in the fourth remaining text, as explained later in this sec-
tion.
2.2 Extraction of syntactic patterns
In our study we consider a syntagmatic approach. The text is first segmented
into a set of sentences, each sentence is then represented by a sequence of
syntactic labels (POS-tag)1 corresponding to the words of the sentence using
Treetagger (Schmid 1994). This produces at the end a set of syntactic se-
quences for each text. For exemple, the sentence “Le silence profond régnait
nuit et jour dans la maison.” Will be represented by the sequence:
< "#$ , '() , *"+ , ,#- , '() , .(' , '() , /-/ , "#$ , '() , 0#'$ >
Then, sequential patterns of a certain length with their supports (a number
indicating how many sentences contain the pattern) are extracted from this
syntactic sequential database using a sequential pattern extraction algorithm
(Viger et al. 2014). Syntactic pattern consists of a sequential syntagmatic
segment (with possible gaps) present in the syntactic sequences. It can be
considered as a kind of generalization of the notion of n-gram widely used in
the field of automatic language processing. Examples of syntactic patterns
present in the sequence of the example above:
• < "#$ >< '() >< *"+ >
• < '() >< *"+ >< ,#- >< '() >
• < .(' >< '() > <∗ 2 > < "#$ >< '() >
To avoid the effect of statistical fluctuations on the analysis of patterns
with low supports, we considered a support’s threshold of 1%. That is to say
that we focus only on patterns that are present in at least 1% of the sentences
of the analyzed text. However, as sequential pattern mining is known to pro-
duce a large quantity of patterns even from relatively small samples of texts,
1
Frech treetagger tagset:
http://www.cis.unimuenchen.de/~schmid/tools/TreeTagger/data/french-tagset.html
2
<*> denotes a gap that can be filled with any POS tag
34 M-A. Boukhaled, F. Frontini and J-G. Ganascia
an interestingness measure should be applied on these patterns in order to
identify the most important ones. This interestingness measure is explained in
the next section.
2.3 Evaluation of the relevance of syntactic patterns
Our hypothesis to evaluate the relevance of a syntactic pattern is based on the
fact that the most relevant ones should significantly reflect the stylistic choice
of the author and should thus be characterized by a significant peculiar quan-
titative behavior, this peculiar behavior translate into a support’s over-
representation in his texts.
However, to capture this overrepresentation one cannot refer only to the abso-
lute frequency of occurrence (support) Indeed, more frequent use of a syntac-
tic pattern by an author (which translates into a relatively high support) does
not necessarily indicate a stylistic choice since it can be very well a property
imposed by the grammar of the language or by syntactic features that are
characteristic of text’s genre.
Thus, to assess the over-representation of a pattern, we use an empirical ap-
proach based on the comparison of the support of a syntactic pattern in a text
to that found in a norm corpus. A ratio 4 between these two quantities is
calculated as follow:
frequency of a pattern in the norm corpus
4=
frequency pattern in the text
In our experiments we found empirically that the distribution of the ratio 4
exhibits a Gaussian behavior. Indeed, the values of the 4 ratio are normally
distributed around a central value (see Fig. 1). This is due to the fact that the
frequency of occurrence of a syntactic pattern in a text is highly correlated
with the frequency of occurrence in the norm corpus with a few exceptional
special cases or outliers (see Fig. 2). These outliers represent the patterns of
special interest for our study because they represent a certain linguistic devia-
tion that is specific to the author's style compared to what one would expect
to see in the norm corpus.
A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 35
Fig. 1. Gaussian behaviour of the ratio 4 in Balzac’s “Eugénie Grandet” novel
The configuration described above allows us to use an outlier detection meth-
od based on Gaussian distribution and '-score to identify such special pat-
terns (Chandola et al. 2009). The over-representation of a pattern in this case
will result in a greater negative aberrant behavior compared to other patterns.
The most over-represented patterns will be those associated with lowest values
of standard z-score '. The z-score values are calculated as follows:
4( ) 4*
'( =
0
Where 4( and '( are respectively the ratio 4 and the z-score corresponding to
the +-th syntactic pattern. 4
, and 0 are respectively the mean and standard
deviation of the ratio 4 .
Fig. 2. Frequencies of syntactic patterns in a text with respect to their frequen-
cies in the norm corpus for the studied novel. Each point in the graph represents
a syntactic pattern. The plotted lines represent the linear regression lines captur-
ing the expected behaviour of the α ratio
36 M-A. Boukhaled, F. Frontini and J-G. Ganascia
3 Results and Discussion
In this section, we present some examples of relevant syntactic patterns ex-
tracted from our corpus. Using the proposed method, the extracted patterns
seem to have a strong relevance to characterize the style of the authors of our
corpus but also to the novels’ content and the literary genre in which it oper-
ates. In the Flaubert's Madame Bovary, several extracted patterns well repre-
sent the rhythmic rather than functional role of punctuation that is peculiar
to the style of Flaubert (Mangiapane 2012). For example pattern (1) captures
instances of a comma preceding the conjunction, followed by a parenthetical
clause.
Pattern (1) < KON>< PUN> , with support= 113,
sample instances of the pattern in the text:
• , et , à
• , mais , avant
• ; et , à
In le Ventre de Paris of Zola, and in the same direction, the syntactic pat-
terns extracted as relevant clearly represent the use of nested clauses to de-
scribe situations or attitudes in the novel such as in the pattern (2), or to
describe public places and objects in displays in long lists as in the pattern
(3):
Pattern (2) : , support= 104, sample
instances of the pattern in the text (bold text):
« Florent se heurtait à mille obstacles , à des porteurs qui se chargeaient , à
des marchandes qui discutaient de leurs voix rudes ; il glissait sur le lit épais d'
épluchures et de trognons qui couvrait la chaussée , il étouffait dans l' odeur puissante
des feuilles écrasées .»
Pattern (3): , support= 68,
sample instances of the pattern in the text (bold text):
• angles , à fenêtres étroites
• très-jolies , des légendes miraculeuses
• écrevisses , des nappes mouvantes
In Eugénie Grandet of Balzac, other different communicative functions are
performed by the syntactic patterns and their textual instances, for example:
Pattern (4): , support= 49, which is
used as post-introducer of direct speech. This rather formulaic way of specify-
ing (in a parenthetical form) the utterer of a reported speech is common to
all, but seems to be strongly preferred by Balzac, while the other authors have
A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 37
shown a more varied style in introducing dialogues. Sample instances of the
pattern in the novel:
• , dit Grandet en
• , reprit Charles en
• , dit Cruchot en
Pattern (5): , support= 54, is a pattern used to
refer to money, which is typical for the novel scenario where money plays a
very important role. Sample instances of the pattern in the novel:
• vingt mille francs
• deux mille louis
• sept mille livres
Pattern (6) : , support= 59, is used to
express negative questions :
• n' avait -il pas
• ne disait -on pas
• ne serait -il pas
Pattern (7) : , support= 44, repre-
sent the punctuation extensively used to mimic spoken intonation and even to
reproduce performance phenomena such as stutter. :
• , messieurs , cria
• , madame , répondit
• , mademoiselle , disait
The few analyzed examples indicate that the presented technique is effec-
tive in extracting interesting syntactic patterns from a single text, and this
seems particularly promising for the analyses of such classic literary texts.
On the other hand, this technique, as well as other similar ones, prompts the
question of what is really captured by significant patterns. Some structures
may be significant because they are typical of an author’s style, its fingerprint
- as we may say borrowing a metaphor often used in attribution studies, or
they may be dictated by functional needs, due to the particular topic of the
novel, or to the conventions of the chosen genre. This is particularly true for
syntactic analysis, where the functional constraints on the authorial freedom
are more evident. Much further works have to be carried out concerning this
issue.
4 conclusion
In this paper, we have presented an objective interestingness measure to ex-
tract meaningful stylistic syntactic patterns from a given author’s work. Our
hypothesis is based on the fact that the most relevant syntactic patterns
should significantly reflect the author’s stylistic choice and thus they should
38 M-A. Boukhaled, F. Frontini and J-G. Ganascia
exhibit some kind of peculiar overrepresentation behavior controlled by the
author’s purpose. To evaluate the effectiveness of the proposed method, we
conducted an experiment on a classic French Corpus. The analyzed results
show the effectiveness in extracting interesting syntactic patterns from this
type of text.
Based on the current study, we have identified several future research direc-
tions such as exploring other statistical measures to assess the interestingness
of a given syntactic pattern, and expanding the analysis to include morpho-
syntactic patterns (form and lemma words). Finally, we intend to experiment
with other languages and text sizes using standard corpora employed in the
field of computational stylistics at large.
References
Biber, D., 2006. University language: A corpus-based study of spoken and written
registers, John Benjamins Publishing.
Biber, D. & Conrad, S., 2009. Register, genre, and style, Cambridge University Press.
Chandola, V., Banerjee, A. & Kumar, V., 2009. Anomaly detection: A survey. ACM
Computing Surveys (CSUR), 41(3), p.15.
Craig, H., 2004. Stylistic analysis and authorship studies. A companion to digital
humanities, 3, pp.233–334.
Frontini, F., Boukhaled, M.A. & Ganascia, J., Linguistic Pattern Extraction and
Analysis for Classic French Plays.
Mahlberg, M., 2012. Corpus stylistics and Dickens’s fiction, Routledge.
Mangiapane, S., 2012. Ponctuation et mise en page dans Madame Bovary: les
interventions de Flaubert sur le manuscrit du copiste. Flaubert. Revue critique et
génétique, (8).
Quiniou, S. et al., 2012. What about sequential data mining techniques to identify
linguistic patterns for stylistics? In Computational Linguistics and Intelligent
Text Processing. Springer, pp. 166–177.
Ramsay, S., 2011. Reading machines: Toward an algorithmic criticism, University of
Illinois Press.
Schmid, H., 1994. Probabilistic part-of-speech tagging using decision trees. In
Proceedings of the international conference on new methods in language
processing. pp. 44–49.
A Peculiarity-based Exploration of Syntactical Patterns: a Computational Study of Stylistics 39
Siemens, R. & Schreibman, S., 2013. A companion to digital literary studies, John
Wiley & Sons.
Viger, P.F. et al., 2014. SPMF: A Java Open-Source Pattern Mining Library. Journal
of Machine Learning Research, 15, pp.3389–3393.