<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A multitude of linguistically-rich features for authorship attribution</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ludovic Tanguy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Assaf Urieli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Basilio Calderone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nabil Hathout</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Franck Sajous</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLLE-ERSS: CNRS &amp; University of Toulouse firstname.lastname @univ-tlse2.fr</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports on the procedure and learning models we adopted for the `PAN 2011 Author Identi cation' challenge targetting real-world email messages. The novelty of our approach lies in a design which combines shallow characteristics of the emails (words and trigrams frequencies) with a large number of ad hoc linguistically-rich features addressing di erent language levels. For the author attribution tasks, all these features were used to train a maximum entropy model which gave very good results. For the single author veri cation tasks, a set of features exclusively based on the linguistic description of the emails' messages was considered as input for symbolic learning techniques (rules and decision trees), and gave weak results. This paper presents in detail the features extracted from the corpus, the learning models and the results obtained.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction and motivations</title>
      <p>
        Given a set of texts and the corresponding authors (in our case email messages), the basic idea
underlying Authorship Attribution (AA) systems is that various calculable textual features may
be relevant enough to capture statistically-based generalizations of the `stylistic distinctiveness' of
the writers, in order to distinguish texts written by di erent authors. These `textual features' thus
refer to the ability of representing a text by a set of attributes that capture a particular author's
`writing style' [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        As members of the NLP team of a linguistics research laboratory, we decided to concentrate
on calculating a large set of features, as we saw in this challenge an opportunity to take advantage
of our experience in a wide variety of automated text annotation tasks. As will be presented in
detail in x 2, we addressed both traditional features used for AA [
        <xref ref-type="bibr" rid="ref10 ref18">10, 18</xref>
        ], and more innovative ones
corresponding to the investigation of more focused textual phenomena.
      </p>
      <p>Our de nition and selection of these features was based on an approach that originates in the
linguistic side of NLP and computational linguistics. More precisely, the rst phase of our work was
to carefully examine the data, and to use a combination of human intuition and computer-aided
investigation tools, a common method used in areas such as corpus linguistics.</p>
      <p>The next step was to design (in most cases, to recycle) speci c processing methods to compute
the selected features. Team work, available linguistic resources and the variety of our experiences in
di erent areas of computational linguistics were crucial assets here. As described in the following
sections, we were thus able to combine morphological, syntactic and semantic techniques, and
also to design many data-speci c features. If some of the features rely on well-known techniques,
we also designed innovative approaches, such as the di erent measures used to assess syntactic
complexity and semantic cohesion.</p>
      <p>The last step was to feed all these features into a machine learning algorithm (x 3). Given
the sheer number of features (especially when word and trigrams frequencies are involved), our
choice was to use a maximum entropy learner, which is able to cope with such quantities and with
a variety of feature types (both quantitative and qualitative). However, we also wanted to have
some kind of feedback of this procedure, and to identify which features were the most e cient
for the task. This led us to apply other machine learning methods, namely tree- and rule-based
algorithms whose main advantage is to give intelligible representations of the learned schemes. As
far as the task results are concerned, this second choice was not a wise move, as we will discuss in
the last section.</p>
      <p>The PAN 2011 Authorship Attribution competition was based on real-world email messages
(extracted from the Enron corpus) and consisted in two di erent sets of tasks. The rst one
(standard authorship attribution) required the participants to infer who was the author of each
message from the test data (from all the authors present in the training data). Some of the test sets
also contained emails written by unknown authors (absent from the training data): these subtasks'
names are identi ed with a + sign. Training sets contained 3,000 messages from 26 authors (small
data set) and 10,000 messages from 72 authors (large data set), while test sets ranged from 400 to
1,500 messages. The second set of tasks (authorship veri cation) focused on a single target author,
requiring the participant to simply decide for each message in the test data if it was written by
this particular author or not. Three di erent authors were targeted: for each author the training
data contained about 50 messages, and test data about 100, with of course a totally unknown
distribution.</p>
      <p>The data itself was very challenging, especially when compared to other data sets used for this
kind of task (mostly litterary texts). In addition to the (very) sloppy writing inherent to email
messages, the data was very heterogeneous, as the authors used emails in very di erent contexts
(personal as well as formal and informal professional communication). Some emails even contained
non-english text, raw computer data and automatically generated content.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Linguistic features</title>
      <p>In this section we give a detailed list of all the features used in this experiment. We chose to
group them according to the linguistic units they address. We identi ed four such levels: sub-word
(morphological units, character trigrams), word, sentence or phrase (syntax) and message.
2.1</p>
      <sec id="sec-2-1">
        <title>Preprocessing</title>
        <p>As many of the features used in this experiment required linguistic information from di erent
language levels, our rst step was to apply NLP tools to the raw text messages provided for this
competition.</p>
        <p>
          We chose Stanford CoreNLP [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to do most of the work. This suite of tools is freely available
and ready-to-run for English, and it addresses all the basic levels of natural language processing,
mainly:
{ tokenization: identi cation of word and sentences boundaries, according to punctuation marks;
{ POS tagging : identi cation of each word's part-of-speech (POS) category (Noun, Verb, etc.),
along with a number of in ectional features (number, tense, etc.);
{ lemmatization: identi cation of each word's lemma, or citation form;
{ syntactic parsing : identi cation of the syntactic relationships between words in a sentence.
        </p>
        <p>The Stanford Parser is a dependency analyzer, and provides tagged pairwise links between
syntactically related words (subject, object, determiner, etc.);
{ named entity recognition: identi cation and classi cation of expressions that refer to a person,
organization, date, quantity or location.</p>
        <p>To avoid tweaking the main parameters of these programs, we applied a small number of
modi cations to the data. We replaced all &lt;NAME/&gt; XML tags (resulting from the anonymization
process) with an arbitrary string, as CoreNLP could not cope with such input. We also detected
(and replaced with a short arbitrary string) the message parts which contained raw (non-text)
data, and would have led the most sophisticated processes (mostly the tagger and parser) to crash
due to the lack of sentence delimiters. Finally, we limited the size of sentences to be processed
to 50 words, discarding longer sentences which would have taken too long to parse and would
probably not lead to interesting ne-grained linguistic features.</p>
        <p>The features used for our experiment were computed on the output of these processes, with
a few exceptions that are speci cally indicated. The following sections give the complete list,
according to the linguistic level to which they belong.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Sub-word-level features</title>
        <p>The rst level addresses linguistic characteristics that appear inside words themselves. Most of
these do not need any kind of linguistic preprocessing and were in fact computed directly on the
raw message text.</p>
        <p>
          Character trigrams. Language models based on character n-grams have long been used for
various text mining tasks, such as language identi cation [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Regarding authorship attribution,
they o er many advantages: they have proven to be e ective [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and additionally, they are language
independent and do not require any preprocessing (not even tokenization). Merely using character
n-grams may partly capture some of the features described below: su xes, punctuation, case,
contractions, smileys, function words such as prepositions, pronouns and conjunctions, as well as
some cases of alternative spelling (e.g. American vs British English, such as -or /-our, -ise/-ize).
        </p>
        <p>The character trigrams occurring in the larger train set were extracted and the 10,000 most
frequent ones were retained as features.</p>
        <p>
          Su xes. Features related to the morphological properties of the words can be computed both at
the sub-word-level and at the word-level. In both cases, we used the CELEX database [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a
resource which provides morphological analysis for a signi cant fragment of the English lexicon. The
morphological features computed at the sub-word-level concerned the su xes of derived words in
the CELEX parses. These features were computed directly on the lemmas. For instance, one such
feature is the number of words in the message whose lemmas end in -ion. We only considered
sufxation because su xed words are the most frequent derivatives and su xes are better predictors
of word formation than pre xes.
        </p>
        <p>More precisely, we collected from CELEX the 149 su xes which occur at the end of at least
one su xed word. Then, for each message and each su x, we counted the number of words which
end with this su x. Two additional features were also computed for each message: the ratio of
su xed words to total words and their overall number.</p>
        <p>Punctuation. The rate of each regular punctuation mark, multiple marks such as !! and ???,
as well as space misuse in punctuation was computed. Indeed, some authors are more prone to
neglect the use of punctuation marks in emails than in formal writings, or do not have an in-depth
knowledge of typographic rules. Recurrent misuse of spaces may reveal a non-native speaker, e.g.
a French author may repeatedly leave a blank before a colon when he writes in English. A list of
34 features corresponding to regular punctuation marks and misuse patterns was thus compiled.
Smileys. A list of possible smileys was compiled and the 6 appearing in the Test sets were selected.
For each of these speci c smileys, a binary feature indicated the presence or absence of this smiley
in a given message. The use of smileys was expected to be more e ective in capturing types of
messages (formal/informal) rather than authors' style. However, the fact that some authors use
some speci c smileys, or no smiley at all may be a hint of authorship.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Word-level features</title>
        <p>These features deal with formal characteristics of word units. They take advantage of the
tokenization performed by CoreNLP (and therefore inherit its errors). For most features,
lemmatization and POS tagging results were used, and for some of them we had to build speci c lists of
words or patterns, which we made available at the following URL:
http://redac.univ-tlse2.fr/projects/authorshipidenti cation/.</p>
        <p>Word frequencies. According to the traditional approaches of text categorization, we computed
the relative frequency of each wordform encountered in the messages. As it will be detailed in x 3,
this set of features will be discarded in some parts of our experiments to focus on linguistically
richer features.</p>
        <p>Case. The proportion of all capitals tokens and tokens starting with an upper case was computed.
Indeed, some authors tend to omit initial capital letters after sentence endings. Some writers have
a tendency to \shout", i.e. write a whole sentence, or a part of it, in all caps. Two speci c features
were dedicated to the forms i /I, as some authors use the lower case in the emails instead of the
regular upper case.</p>
        <p>Morphologically complex words. Morphological features were also computed at the word level,
once again with the CELEX database. We compiled the 30,693 morphologically complex words
from CELEX, leaving out the ones formed by conversion (e.g. handN ! handV ). These words
include su xed words (e.g. governable), pre xed words (e.g. anti-aircraft ) and compounds (e.g.
handbook ). Two features were computed for each message: the ratio of morphologically complex
words to total words and their overall number.</p>
        <p>Word length. Messages can also be characterized by the length of their words. For instance, the
presence or absence of long words may indicate a message's technicality. Word length is also an
approximate estimation of the morphological complexity of the words: longer words tend to be
more complex. For each message, we counted the number of words of each length and we also
computed the mean word length.</p>
        <p>In ections and stopwords. The in ection rate (ratio of in ected forms to lemmas) was
computed and resulted in 3 features (for nouns, verbs and adjectives). The stop words rate was used
as an additional feature.</p>
        <p>
          Spelling errors. Spelling errors have long been identi ed as related to a particular author's
writing speci cities [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. We compared each lowercase word in a message to a reference wordlist
(from the wamerican Debian package), and applied a brute-force approach to check for speci c
spelling errors. More precisely we tested for the following errors (in order of priority): excess
repeated letter, missing repeated letter, letter inversion, excess letter, missing letter, modi ed
letter. In addition, and in cases when none of the previous tests led to a known word, we tested
for word collision through automated querying of the Bing Web search engine (which, when given
as input "o cetoday " proposes to correct it in "o ce today ").
        </p>
        <p>The resulting features give the overall rate of misspelled words in a message, as well as speci c
rates for each kind of spotted error. While at rst we considered letter-speci c errors (e.g. repeated
inversion of i and o), the lack of variety over the corpus led us to limit our features to broad
categories.</p>
        <p>Contractions and abbreviations. A list of contractions and their corresponding full forms was
compiled manually and resulted in about 200 features. These forms were then divided into the
following classes: positive contraction with apostrophe (we're), negative contraction with
apostrophe (isn't ), positive full forms (we are), negative ones (is not ), standard and nonstandard
\collapsed" forms (cannot, gimme), \improper collapsed" forms (arent, thats) and a class of words
used speci cally in spoken language (gonna, kinda, nope, ya, . . . ). These classes resulted in 7
additional features.</p>
        <p>A list of 450 abbreviations was compiled manually. It contained usual abbreviations such as
ASAP or wrt, some taken from SMS language (e.g. 2L8 for too late) and some other originating
from chat rooms (e.g. BBIAB for be back in a bit ). Each abbreviation resulted in an individual
feature.</p>
        <p>US/UK variants. Two lists of roughly 550 American words and 550 British ones were
manually compiled, using lexicons available from the web. These lists include alternative words
(vacation/holiday, zip code/postcode), spellings (color /colour, vs./vs). The vast majority of the messages
were expected to be written by American authors. However, for some authors, the use of American
and British words may overlap (e.g. a message in the test set contains the British holiday and the
American bill and gotten).</p>
        <p>Binary features denote the presence/absence of the 70 British and 170 American forms taken
from the aforementioned lists that occur in the Test sets. Two features account for the total
number of forms (types) from both categories and two features represent the ratio of American
(resp. British) forms to the number of tokens.</p>
        <p>
          WordNet. A total of 12 features were based on WordNet. Four features represent the proportion
of noun (resp. verb, adjective and adverb) lemmas occurring in the messages that are known from
Princeton WordNet [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Four additional features (one per part-of-speech) consist of the average
number of synsets the words that are known from WordNet belong to. These features roughly
represent the degree of polysemy of the words used by the authors. For each message, two features
represent the average depth of the noun and verb synsets in WordNet's conceptual hierarchy. Two
other features denote the average minimal depth for nouns and verbs. These depth-based features
are indicative of semantic speci city [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          Named entities. As presented above, the CoreNLP suite provides a named entity (NE)
recognition and classi cation tool [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Therefore, we computed the relative frequency of each NE type in
a message (date, location, money, number, ordinal, organization, percent, person and time).
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Sentence-level features</title>
        <p>
          The next set of features addresses the syntactic level. Syntax can be approached through crude
techniques, or by taking advantage of the syntactic parsing provided by CoreNLP.
N-grams of part-of-speech tags. Bigrams and trigrams of part-of-speech tags (ignoring in
ection indicators) were computed for all messages. The 732 bigrams and the 1,000 most frequent
trigrams (from a total of 6,598) were retained. Part-of-speech trigrams have been used in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] as an
approximation of syntactic structures, in order to detect two groups of speakers including English
and Finnish emigrants to Australia, and of adult and children among the Finnish immigrants.
Part-of-speech n-grams are relevant here either because they reveal di erences between native and
non-native speakers, and di erences in the syntactic patterns used by native speakers.
Syntactic depth and complexity. The general notion of syntactic complexity can be seen as
an important feature of an author's style. In addition to the simple measure of sentence length (in
number of words), we measured two di erent parameters for each sentence in a message.
        </p>
        <p>The rst is simply the depth of the syntactic tree resulting from the syntactic dependencies
provided by the parser (after minor transformations). Sentences with deep trees have a large
number of intermediary constituents, such as complex noun phrases, subordinates, etc. We measured
both the maximal depth for each message and the average depth.</p>
        <p>
          The second parameter is more directly related to the output of the parser. Following [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
we measured the average and maximal distance (expressed in number of words) covered by the
identi ed dependency links. The resulting feature is not correlated to the tree depth, but can
capture another dimension of syntactic complexity [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. As an example, a complex subject noun
phrase (e.g. including a subordinate clause) results in an increased distance between the head noun
and the verb.
        </p>
        <p>Syntactic dependencies. From the aforementioned syntactic parsing, a list of 2,439 types of
syntactic dependencies was extracted. A dependency consists of the governor and dependent's
parts-of-speech (leaving out in ection indicators), linked by a given relation. The 1,000 most
frequent dependencies were retained and resulted in 1,000 corresponding features denoting the ratio
between the occurrences of a given dependency in a message and the total number of dependencies.
We added three more general features to represent the total number of dependencies in a message,
this number divided by the number of tokens and divided by the number of sentences.
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Message-level features</title>
        <p>The fourth set of features deals with the message at a global level, and addresses a variety of
linguistic issues. The following features tackle the message's typesetting, discourse structures and
organization, as well as general semantic phenomena such as semantic cohesion.
Message length and general typesetting. The number of tokens, sentences, lines and the
proportion of blank lines were computed for each message.</p>
        <p>Openings and closings. The most frequent opening lines were extracted and manually selected
as potentially relevant. This selection resulted in 22 features including: no opening line, proper
name followed or not by a colon or a comma, greetings starting with hello/hey/hi/dear, etc.</p>
        <p>Potential closing patterns were extracted from the messages and manually selected. First, a
list of non anonymized signatures (lower case rst names and family names, initials, etc.) resulted
in 72 features. The presence/absence of some closing formulas (thanks/thanx/(all) the best/(best)
regards/take care, etc) or of (anonymized) names resulted in 10 features. Finally, a feature was
dedicated to the detection of 44 selected closing patterns, consisting of a combination of names,
comma/dot, blank lines, closing formulas, etc.</p>
        <p>Distributional semantic cohesion. This last set of features was computed in order to estimate
the \semantic cohesion" of an email message. The idea is to compute semantic similarity between
words in a message, so that cohesive messages (dealing with a narrow topic, and thus containing
many semantically related words) can be distinguished from the ones addressing multiple subjects
(with many unrelated words). This kind of measure can also be seen as an attempt to capture
some of the writing behaviors of authors.</p>
        <p>Measures of semantic similarity between words can be obtained from manually-constructed
resources such as WordNet or from the data produced by distributional semantic analysis. The
distributional analysis assumption is that similar words can be identi ed according to their
occurrence in similar contexts. Thus, through the syntactic processing of a very large quantity of texts,
speci c techniques can automatically quantify the semantic similarity between words.</p>
        <p>
          We used Distributional Memory (DM) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] which organizes the distributional information in a
third-order tensor. It has been computed on a very large corpus of diversi ed texts. Speci cally,
we used the TypeDM model1 and its word-by-link-word matrix, which results from the tensor's
matricization.
        </p>
        <p>In particular, for each adjective, noun and verb in the matrix, we identi ed the 150 nearest
semantic neighbors by calculating the cosine similarity. The features de ning the semantic cohesion
of the message were then obtained by counting the number of neighbors for each pair of lemmas
found in the message. Di erent neighborhood sizes were considered: from 10 to 150 neighbors (in
steps of 10).
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Machine learning techniques and tools</title>
      <p>The next step in our approach was to apply a machine-learning technique to this huge set of
features. We used two di erent kinds of techniques, which are presented in this section.
3.1</p>
      <sec id="sec-3-1">
        <title>Maximum entropy</title>
        <p>
          For the author attribution tasks, we trained a single maximum entropy model [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] using the
OpenNLP2 MaxEnt library. While it is theoretically possible for a maximum entropy model to
integrate thousands of heterogeneous features, both numeric and nominal, it is critical to normalize
numeric features to avoid a bias in the model towards features with a larger numeric scale.
        </p>
        <p>The features described above were collected as a set of CSV les. To simplify the task of
normalizing, discretizing, training and evaluating based on features spread among dozens of les,
we coded a software module, csvLearner3, distributed as open source, and speci cally designed for
collaborative machine learning projects.</p>
        <p>
          Feature normalization. For the sake of simplicity, we initially tried to unify the features into
a purely nominal space by applying Fayad and Irani MDL discretization [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We then attempted
two types of normalization: setting max value to 1.0, and setting the mean value to 0.5 (the latter
to correctly normalize features with a handful of high outliers). Finally, we divided the features
into several groups, each of which was normalized based on the max value of the entire group,
while leftover features were normalized individually. By normalizing an entire group of interrelated
features together (e.g. POS tag trigram counts), we avoided loss of information contained in their
relative scale (e.g. relative count of each POS tag trigram for a given message).
        </p>
        <p>Results for each of these normalization methods, when applied to SmallTrain using 10-fold
cross validation, are shown in the table below. Since there is not a signi cant di erence between
max and mean normalization, we settled on max grouped normalization.</p>
        <sec id="sec-3-1-1">
          <title>1 http://clic.cimec.unitn.it/dm/#data 2 http://incubator.apache.org/opennlp 3 https://github.com/urieli/csvLearner</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Method</title>
        <p>Raw features
Discretization
Normalization (max)
Normalization (mean)
Grouped normalization (max)
Grouped normalization (mean)
Separating the wheat from the cha . For tasks including unknown authors, the training sets
were evaluated to nd a criterion for separating known and unknown authorship. One of the main
advantages of the maximum entropy machine learning algorithms is that it provides a probability
for each possible outcome. Various parameters were thus considered, including p1 (best guess
probability), p2 (second guess probability), p1 p2, z-score(p1), z-score(p2) and standard deviation
of probabilities for all outcomes. Curiously, parameter spread for wrongly guessed messages and
for messages by unknown authors was strikingly similar, and clearly di erentiated from the spread
for correctly guessed messages.</p>
        <p>We nally settled on p1 alone as providing the cleanest and simplest criterion, and selected the
p1 threshold which maximized overall accuracy on SmallValid+ (p1 &lt; 0:66 results in 'unknown')
and LargeValid+ (p1 &lt; 0:4). We also submitted another run which aimed at a better precision,
with respective thresholds of 0:95 and 0:75.
3.2</p>
      </sec>
      <sec id="sec-3-3">
        <title>Rule-based learning and decision trees</title>
        <p>For the veri cation tasks (Verify1, Verify2 and Verify3, each dealing with a di erent single
author), we applied a machine learning technique that allows us to get interpretable results, and
thus to identify the linguistic pro le of the target author. Another objective was to consider
a decision scheme that does not rely on \knowledge-poor" features (mostly trigrams and word
frequencies), but on linguistically-enriched features only. This decision was not motivated by the
speci city of the veri cation subtask, but rather by the smaller size of its data and its more speci c
and strict aim.</p>
        <p>We thus switched from the heavy-duty opaque maximum entropy learner to less sophisticated
learning techniques. The speci c algorithm for each set was simply chosen according to the
compared performances on the training data.</p>
        <p>
          For the Verify1 dataset, we adopted a decision tree (C4.5 algorithm [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]) which splits the
training dataset in di erent subsets according to the most `informative' features, i.e. the features that
maximize the information gain condition (the highest score of information gain for one feature if
we use that feature to split the data). For the remaining Verify2 and Verify3 datasets, we exploited
a rule-generating algorithm (`Repeated Incremental Pruning to Produce Error Reduction' or
RIPPER [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]) that implements an incremental reduced-error pruning method to establish whether a
rule should be pruned or not. Both classi cation algorithms were optimized, trained and applied
in the WEKA software package4 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>In both cases, the training data was constituted of the corresponding provided training sets,
consisting of a few dozens messages by the target author, to which we added 1,500 messages
written by a variety of di erent authors from the SmallTest training set.</p>
        <p>Details. Figure 1 illustrates the decision tree adopted after training the Verify1 dataset. For better
readability we do not indicate the exact thresholds for each decision node. The rst node of the
model corresponds to a speci c typographic error (in this case, the absence of a blank after a
question mark, see x 2.2). Other classi catory features are the amount of person named entities
(see x 2.3), the distributional semantic neighbors extracted from the Distributional Memory (DM)
(see x 2.5), the maximal depth of the syntactic tree (see x 2.4), the number of past participles as
indicated by the POS parser (see x 2.3) and the apostrophe count for each message (see x 2.5).</p>
        <p>Table 1 illustrates the rules resulting from the RIPPER model for the Verify2 and Verify3
tasks. In addition to some distributional semantic neighbors features, these datasets made use of</p>
        <sec id="sec-3-3-1">
          <title>4 http://www.cs.waikato.ac.nz/ml/weka</title>
          <p>a mix of word-level features: the presence of words with speci c morphological su xes (x 2.3) and
the frequency of a speci c POS trigram (x 2.4).</p>
          <p>Verify2 { RIPPER rules:
1. if DM 90neighbors N ORM</p>
          <p>then Y
2. if DM 20neighbors N ORM</p>
          <p>then Y
3. otherwise N
Verify3 { RIPPER rules:
0:00493 and DM 80neighbors</p>
          <p>9 and AP OST ROP HE
0:0173 and COLON
0:0090 and DM 10neighbors
0
28
1. if DM 30neighbors N ORM 0:00724 and M orpho SU F F IXES N B
2. if number P OS SY M 0:175439 then Y
3. if DM 20neighbors N ORM 0:03649 and DM 130neighbors N ORM
4. if DM 20neighbors N ORM 0:001346 and DM 30neighbors 6 and</p>
          <p>T rigrams P OS = DT N N : 0:009346 then Y
5. otherwise N
7 then Y
0:197287) then Y</p>
          <p>This tree and these sets of rules gave us positive feedback concerning the usefulness of a variety
of linguistic features. They can also tell us speci c information about each target author. For
example, author 1 tends to forget blanks after question marks, and uses a high number of people's
names. However, the di erent semantic distribution scores seem to have the highest discriminative
power, and yet are not very easy to interpret in terms of an author's general writing style.</p>
          <p>Table 2 summarizes the scores obtained with the aforementioned models with respect to the
three veri cation tasks. The scores concern the positive (target author) class, obtained on average
over a 10-fold cross-validation. We initially were quite satis ed with these results, that seemed on
a par with the ones obtained with the maximum entropy method, although on a very di erent
task.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results and discussion</title>
      <p>The overall results of our di erent runs are the following: we apparently did very well for the
attribution tasks, but failed for all three veri cation tasks. Detailed values are reported in Table 3.
For the Large+ and Small+ tasks, it was indeed the low-threshold version that gave the best
Task
Verify1
Verify2
Verify3
results in terms of F-score: with higher probability thresholds, the precision gain was considerable
but did not counterbalance the loss in recall.</p>
      <p>Task
Large
Large+
Small
Small+
Verify1
Verify2
Verify3
Many things can be learnt from these results.</p>
      <p>The rst thing is that the good results we obtained for the attribution tasks may well be related
to the linguistically-rich features we used in addition to more traditional word/trigram frequencies.
During the training phase, we indeed observed that these features increased the estimated e ciency
of the maximum entropy learner. However, we still have to investigate the amount of information
added by individual features or feature sets. As noted above, this estimation is much more di cult
for this kind of method than for symbolic learning, as the information about the model is only
accessible through a set of weights and require a detailed statistical analysis. However, we intend
to test a larger number of features combinations in order to assess their exact contribution to the
nal results.</p>
      <p>Another good point for our method is the apparent reliability of the probability score provided
by the maximum entropy algorithm: this appears to be a good predictor that allowed our method to
rank very high with the unknown author tasks. This estimation is, to our eyes, a clear advantage of
this learning technique, that can have several positive implications in other tasks where maximum
entropy has proven to be e cient (such as parsing).</p>
      <p>The last issue raised is our failure in the veri cation tasks. It may be linked to some of the
following causes: abandonment of the poorer features, switch from maximum entropy to rule-based
techniques, and the fact that author veri cation is indeed a very di erent task from attribution.
Each of these hypotheses will now require a speci c investigation that should lead us to a better
understanding of both the tasks, the applicability of the techniques and the relative advantage of
our higher-level features.</p>
      <p>As we have already noted, our main concern as computational linguists is to evaluate the
e ectiveness of more sophisticated language processing methods and features. In too many areas,
linguistically-rich approaches have unfortunately not proven (yet) their e ciency when compared
to heavy statistical methods. We hope that our work will contribute to a better understanding of
how, when, and which linguistic knowledge has to be used.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>As noted in the introduction, our success in this competitive task is the result of a team
that extends far beyond the authors of this paper. Therefore we sincerely wish to thank for their
insights, proposition of features and encouragements the following people (in alphabetical order):
Clementine Adam, Cecile Fabre, Bruno Gaume, Mai Ho-Dac, Anna Kupsc, Marion Laignelet,
Fanny Lalleman, Francois Morlane-Hondere, Marie-Paule Pery-Woodley and Nikola Tulechki.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baayen</surname>
            ,
            <given-names>R.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piepenbrock</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulikers</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The CELEX lexical database (release 2). CD-ROM (</article-title>
          <year>1995</year>
          ),
          <source>linguistic Data Consortium</source>
          , Philadelphia, Penn.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Distributional memory: A general framework for corpus-based semantics</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>36</volume>
          (
          <issue>4</issue>
          ),
          <volume>673</volume>
          {
          <fpage>721</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cavnar</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trenkle</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>N-gram-based text categorization</article-title>
          .
          <source>In: Proceedings of the Third Annual Symposium on Document Analysis and Information Retrival</source>
          . pp.
          <volume>161</volume>
          {
          <issue>175</issue>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          :
          <article-title>Fast e ective rule induction</article-title>
          .
          <source>In: Twelfth International Conference on Machine Learning</source>
          . pp.
          <volume>115</volume>
          {
          <issue>123</issue>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irani</surname>
          </string-name>
          , K.B.
          <article-title>: Multi-interval discretization of continuous-valued attributes for classi cation learning</article-title>
          .
          <source>In: IJCAI</source>
          . pp.
          <volume>1022</volume>
          {
          <issue>1029</issue>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fellbaum</surname>
          </string-name>
          , C. (ed.):
          <article-title>WordNet: An Electronic Lexical Database</article-title>
          . MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Finkel</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grenager</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Incorporating non-local information into information extraction systems by gibbs sampling</article-title>
          .
          <source>In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL</source>
          <year>2005</year>
          ). pp.
          <volume>363</volume>
          {
          <issue>370</issue>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          :
          <article-title>The weka data mining software: An update</article-title>
          .
          <source>ACM SIGKDD Explorations Newsletter</source>
          <volume>11</volume>
          (
          <issue>1</issue>
          ) (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Heylen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peirsman</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geeraerts</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Speelman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Modelling Word Similarity: an Evaluation of Automatic Synonymy Extraction Algorithms</article-title>
          .
          <source>In: Proceedings of the Sixth International Language Resources and Evaluation (LREC'08)</source>
          . Marrakech,
          <string-name>
            <surname>Morocco</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <volume>233</volume>
          {
          <fpage>334</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Accurate unlexicalized parsing</article-title>
          .
          <source>In: Proceedings of the 41st Meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>423</volume>
          {
          <issue>430</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Exploiting stylistic idiosyncrasies for authorship attribution</article-title>
          .
          <source>In: IJCAI'03 Workshop on Computational Approaches to Style Analysis and Synthesis</source>
          . pp.
          <volume>69</volume>
          {
          <issue>72</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tanguy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Linguistic features to predict query di culty</article-title>
          .
          <source>In: Proceedings of the ACMSIGIR workshop on Predicting query di culty - methods and applications</source>
          . pp.
          <volume>7</volume>
          {
          <fpage>10</fpage>
          .
          <string-name>
            <surname>Salvador de Bahia Brazil</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Nerbonne</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiersma</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>A measure of aggregate syntactic distance</article-title>
          .
          <source>In: Proceedings of the Workshop on Linguistic Distances</source>
          . pp.
          <volume>82</volume>
          {
          <fpage>90</fpage>
          . LD '06,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuurmans</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Language independent authorship attribution using character level language models</article-title>
          .
          <source>In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , EACL. pp.
          <volume>267</volume>
          {
          <issue>274</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Quinlan</surname>
          </string-name>
          , R.:
          <source>C4</source>
          .
          <article-title>5: Programs for Machine Learning</article-title>
          . Morgan Kaufmann Publishers, San Mateo, CA (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ratnaparkhi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Maximum entropy models for natural language ambiguity resolution</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Pennsylvania, Philadelphia, PA, USA (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <volume>538</volume>
          {
          <fpage>556</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Tanguy</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tulechki</surname>
          </string-name>
          , N.:
          <article-title>Sentence Complexity in French: a Corpus-Based Approach</article-title>
          .
          <source>In: Proceedings of IIS (Recent Advances in Intelligent Information Systems )</source>
          . pp.
          <volume>131</volume>
          {
          <fpage>145</fpage>
          .
          <string-name>
            <surname>Krakow</surname>
          </string-name>
          ,
          <string-name>
            <surname>Poland</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>A framework for authorship identi cation of online messages: Writing-style features and classi cation techniques</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>57</volume>
          (
          <issue>3</issue>
          ),
          <volume>378</volume>
          {
          <fpage>393</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>