<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Paragraph Clustering for Intrinsic Plagiarism Detection using a Stylistic Vector-Space Model with Extrinsic Features</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Toronto</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <abstract>
        <p>Our approach to the task of intrinsic plagiarism detection uses a vectorspace model which eschews surface features in favor of richer extrinsic features, including those based on latent semantic analysis in a larger external corpus. We posit that the popularity and success of surface n-gram features is mostly due to the topic-biased nature of current artificial evaluations, a problem which unfortunately extends to the present PAN evaluation. One interesting of aspect of our approach is our way of dealing with small, imbalanced span sizes; we improved performance considerably in our development evaluation by countering these effect using the expected difference of sums of random variables.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The task of intrinsic plagiarism detection involves distinguishing portions of a single
text which are written by different authors [24]. Key characteristics of the task are the
lack of texts written purely by one author or another, as is typically the case in
authorship attribution [23], and lack of a database of texts from which the texts are formed,
which is the focus of extrinsic plagiarism detection [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. As such, it has more in
common with (to the point of being arguably synonymous with) the task of stylistic
inconsistency detection [
        <xref ref-type="bibr" rid="ref14 ref15 ref18">14,15,18</xref>
        ], and our approach in the task is strongly influenced by this
work.
      </p>
      <p>
        Relatively successful approaches to intrinsic plagiarism detection [
        <xref ref-type="bibr" rid="ref17 ref21 ref22">22,21,17</xref>
        ] have
often relied exclusively on variation in word or character n-gram frequency as the key
indicator of stylistic variation, an approach that is clearly effective in the general task
of authorship attribution [23]. However, in the context of spans as small as paragraphs,
we are somewhat skeptical that these sorts of features capture anything much beyond
the topic shifts which are a common artifact of the usually artificially-created test sets.
In fact, in the context of another stylistic text classification task, native language
identification [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], we found that the effectiveness of character n-grams as stylistic features
seemed to derive largely from the confounding effects of topic in the corpus [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]; when
topic was (partially) controlled for, performance of these features plummeted by over
30%. In the case of real-world plagiarism, we would expect that differences in style, not
topic, would be the key indicator of plagiarism, and, although focusing on surface
(intrinsic) features may provide superficial improvement in artificial settings, we think it is
important to branch out and incorporate stylistic information that reflects underlying
dimensions of stylistic variation; a similar approach applied to various text classification
tasks has shown promise [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In recent work, stylistic segmentation of a real stylistically
diverse document (a long poem) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we compared the typical surface n-gram features
to richer extrinsic features, and found that the n-gram surface features, though
reasonably useful on their own, did not seem to combine well with more-targeted features,
and we ultimately discarded them. Therefore, in the present work, which is in most
other respects a reasonably straightforward clustering approach based on maximizing
vector distance between author spans, we entirely eschew n-gram features in favor of
the linguistically motivated extrinsic features that we applied to poetry segmentation.1
In addition, we use a novel approach based on modeling expected random differences
to attenuate the effects of variation in span length.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature Selection and Extraction</title>
      <p>The set of features that we explore for this task falls roughly into two categories:
surface and extrinsic. The distinction is not entirely clear cut, but we wish to distinguish
features that use the basic properties of the words or their PoS, which have
traditionally been the focus of automated stylistic analysis, from features which rely heavily
on external lexical information such as word sentiment and, in particular, vector-space
representations, which are more novel for this task.
2.1</p>
      <sec id="sec-2-1">
        <title>Surface Features</title>
        <p>
          Word length Word length is a common textual statistic in register and readability
studies. Readability, in turn, has been used for plagiarism detection [24], and related metrics
were consistently among the best performing features for Guthrie [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
Syllable count Syllable count is reasonably good predictor of the difficulty of a
vocabulary, and is used in some readability metrics.
        </p>
        <p>Punctuation frequency The presence or absence of punctuation such as commas, colons,
semicolons, and quotes can be very good indicator of style. We also include periods,
which offer a measure of sentence length.</p>
        <p>
          Parts of speech Lexical categories can indicate, for instance, the degree of
nominalization, which is a key stylistic variable [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We collect statistics for the four main lexical
categories (noun, verb, adjective, adverb) as well as for prepositions, determiners, and
proper nouns.
1 Nonetheless, we believe that vocabulary-based features do have a role to play in intrinsic
plagiarism detection, albeit a restricted one. Differences in (near-)synonym choices may signal
plagiarism; the student writes rain and snow, but plagiarizes some text that instead says
precipitation. But incorporating this feature into an intrinsic plagiarism detection system would
require it to be appropriately circumscribed in order to prevent confounds of topic. We did not
implement this feature here, as it did not seem applicable to the kind of data used in the PAN
task.
        </p>
        <p>
          Pronouns We count the frequency of first-, second-, and third-person pronouns, which
can indicate the interactiveness and narrative character of a text [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Verb tense Past tense is often preferred in narratives, whereas present tense can give a
sense of immediacy.</p>
        <p>Type-token ratio A standard measure of lexical diversity.</p>
        <p>Lexical density Lexical density is the ratio of the count of tokens of the four substantive
parts of speech to the count of all tokens.</p>
        <p>
          Contextuality measure The contextuality measure of Heylighen and Dewaele [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is
based on PoS tags (e.g. nouns decrease contextuality, while verbs increase it), and has
been used to distinguish formality in collaboratively built encyclopedias [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Extrinsic features</title>
        <p>For those lexicons which include only lemmatized forms, the words are lemmatized
before their values are retrieved.</p>
        <p>
          Presence in Dale-Chall Word List A list of 3000 basic words that is used in the
DaleChall Readability metric [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>
          Unigram count in 1T Corpus Another metric of whether a word is commonly used. We
use the unigram counts in the 1T 5-gram Corpus [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Here and below, if a word is not
included, it is given a zero.
        </p>
        <p>
          Sentiment polarity The positive or negative stance of a span could be viewed as a
stylistic variable. We use a hand-built lexicon for the SO-CAL sentiment analysis system,
which has shown superior performance in lexicon-based sentiment analysis [25].
Sentiment extremity Some sentiment lexicons provide a measure of the degree to which
a word is positive or negative. Instead of summing sentiment scores, we sum their
absolute values, to get a measure of how extreme (subjective) the span is. In our previous
work[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we found that an high coverage automatic resource, SentiWordNet (SWN) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
was preferable to a low-coverage manual resource, and so we use that resource instead
of the manual resource used for the polarity feature.
        </p>
        <p>
          Formality Formality score uses a lexicon of formality that we created in previous work
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The values vary between 1 and +1, with negative values indicating informal,
colloquial words (e.g. damn), and positive values indicating formal terms (e.g.
therefore). The lexicon was built by comparing, using cosine similarity, the latent semantic
analysis (LSA) [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] vectors derived from a binary word-document matrix built from a
filtered version of ICWSM Spinn3r weblog corpus [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], similar to the sentiment
lexicon creation of Turney and Littman [26]. For stylistic concerns, binary matrices appear
to be preferable to the tf-idf weighted matrices that are typically used for topic variation
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          LSA vector features In Brooke et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we posited that, in highly diverse register/genre
corpora, the lowest dimensions of word vectors derived using LSA (or other
dimensionality reduction techniques) often reflect stylistic concerns; we found that using the
first 20 dimensions to build our formality lexicon provided the best results in a
nearsynonym evaluation. Early work by Biber [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] in the Brown Corpus using a related
technique (factor analysis) resulted in the discovery of several identifiable dimensions
of register. In the poem segmentation of Brooke et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], we investigated using these
LSA-derived vectors directly, with each of the first 20 dimensions corresponding to
a separate feature, and found that, taken as a whole, they were superior to any other
feature. We use the same word-document matrix, from the ICWSM blog corpus, as in
formality creation. One technical note: the length of the LSA vector depends greatly on
the frequency of the word, which would mean that common words would essentially
drown out rarer words; so before considering each LSA dimension as an individual
feature, we first normalize the entire vector to the unit circle.
        </p>
        <p>
          Normalization For all the features, we also normalize at the level of the word, and
then sum across the span; for instance, to calculate a noun frequency metric, we first
assign each noun in the text a 1, and all other tokens a 0, normalize this distribution to a
mean of 0 and a standard deviation of 1, and then average the normalized values across
the span that we’re interested in. When applying unsupervised techniques with features
of vastly different ranges, normalization is a very important step [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], and calculating
everything at the word level gives the system the flexibility to easily consider many
different possible spans.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Clustering</title>
      <p>
        Our general approach to both of the paragraph-clustering subtasks of the intrinsic
plagiarism detection task is to assign paragraphs into author groups that maximize the
(average) distance between authors. Following Guthrie’s work in stylistic outlier
detection [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and our own previous conclusions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we use L1 or city block distance as
the distance metric. Another important insight of Guthrie that is it is desirable to use
spans as large as possible, i.e. we consider the distance between the spans suspected to
be written by a single author, rather than the distances between individual paragraphs
(e.g. a graph-based approach). In particular, for the single-intruder task, we considered
all possible start/end pairs for second author intrusion, and calculated the difference
between the main and intruder spans, choosing the pair that produced a maximal distance.
For the multi-author task, we began by assigning all paragraphs to a single author and
none to the three other authors. We then iteratively moved spans from one author group
to another, each step being the one that provided the maximum increase in average
distance, until no further improvement was possible.
      </p>
      <p>However, there is a serious flaw in this kind of approach: all other things being
equal, shorter spans have more random variation and thus are, on average, more distant
(sometimes much more distant) from any given span than a longer, more homogeneous
span. Fortunately, this effect can be modeled. We did this by calculating the expected
distance of sums of random variables. Supposing a span of some basic length (we used
50 tokens) to have a random component of normal distribution (with a mean and
standard deviation of 1), we can estimate the expected influence of this randomness on the
distance measure between any pair of spans — for instance, spans of length 400 and
100 — by looking at the sum of random variables corresponding to the n basic spans
that make it up — in this example, the expected difference between the sum of 8
random variables and the sum of 2 random variables with the same distribution. We ran
100 trials using a random number generator and computed a table of such expected
differences, and then divided our calculated distance by the corresponding number in the
table to get a new distance that takes expected difference into account.2
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>Even before we reached the final version described above, our approach had perfect
performance on the two example texts provided by the PAN organizers, so we created
some additional corpora for testing, collecting a few different types of texts (early
modern novels, translated Russian novels, and political treatises) from Project Gutenberg,
and automatically creating mixed texts of various difficulty. Here, we present results
using two relatively easy corpora which consist of texts of paragraphs randomly pulled
from novels by Fyodor Dostoyevsky (in English translation) and Thomas Hardy, and
political texts by Thomas Paine and Jean-Jacques Rousseau. For each text in the corpus
for the mixed-author task, we first choose the number of authors (between 2 and 4),
then randomly selected the authors, the number of paragraphs (between 10 and 30) and
then the paragraphs themselves from random locations in the text. For the two-author
insertion task, we randomly choose two authors, a total number of paragraphs, and two
non-equal indices within that range; for each author, a random starting location was
randomly selected and consecutive paragraphs from the first author were randomly
selected for paragraphs before the first index, and then after the second, and consecutive
paragraphs from the second author were inserted between the two indices. Both corpora
have 30 texts created in this fashion.</p>
      <p>
        There are various metrics for extrinsic cluster evaluation; Amigó et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] review
various options and select the BCubed precision and recall metrics [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as having all of a
set of key desirable properties. BCubed precision is a calculation of the fraction of item
pairs in the same cluster which are also in the same category, whereas BCubed recall is
the fraction of item pairs in the same category which are also in the same cluster. The
harmonic mean of these two metrics is BCubed F-score, which served as our metric for
development.
      </p>
      <p>We compare our algorithms with task-specific random baselines (with 50 trials)
and two related alternatives: one which excludes our expected difference corrector and
another that is based purely on the maximizing distances between individual paragraphs
in the spans, rather than treating each cluster as a whole. The results for each of the two
tasks are in Table 1.</p>
      <p>There is little doubt that our expected difference adjustment has an overall positive
effect, and, in the multi-author task, this provides the best result by a reasonably large
2 There may be a closed-form solution to this problem, but in our case it was easier to derive it
empirically.
margin. For the insertion task, which has a much higher random baseline, the individual
span distance comparison was found to be roughly equivalent to our combined span
approach with the adjustment.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>Our linguistically motivated, vector-space clustering approach shows promise,
particularly with our expected difference adjustment. There is, however, obviously more work
to do in this regard; for instance, using this adjustment our multi-author method never,
in practice, predicts more than two authors, probably because the differences between
short spans are now being underestimated rather than overestimated, meaning that two
relatively short author spans (e.g. 3rd and 4th authors) are now highly dispreferred
under our distance-maximizing algorithm. This may partially explain our relatively poor
performance on the multi-author intrinsic plagiarism task, but in fact there is a more
obvious reason. For instance, here are two paragraphs from different authors in the
multi-author task evaluation data (text1):</p>
      <p>John did not dream about the deli. He had nightmares of Douglas falling onto
swords of knights on horseback, and woke several times throughout the night
sweating and breathing heavily.</p>
      <p>But in an empty house, surrounded by evidence of Caroline’s long absence,
Hillie’s words plagued him, and he was forced to accept that his mind might be
capable of the cruelest of tricks. He felt desperately, hopelessly alone.
Stylistically, we find the two authors nearly indistinguishable. There are small
differences (the second author prefers longer sentences and hyperbole), but the easiest way,
for either human or computer, to identify the two is by the names of the characters. All
but two of the paragraphs contains a proper name that appears in several other
paragraphs (one author talks about Geoff, Hillie, and Caroline, the other about Douglas,
John, and Mrs. Cumberland). Beyond proper names, there are also recurring topics:
mail in one story, a job at a deli in another.3 Any model that uses word or character
3 There are even two sets of repeated paragraphs in this particular text: paragraphs 14 and 21 are
the same, as are paragraphs 24 and 30!
n-grams should be able to take easy advantage of these regularities. Our model,
conversely, was specifically designed not to do so; rather, it was developed to detect
significant stylistic differences. In fact, in the task evaluation data there are quite clearly more
obvious stylistic differences between some excerpts from the same novel than between
some excerpts from different novels:</p>
      <p>On his departure, Hillie had pressed her business card into his hand. “My
number’s on there,” she told him. “Call me, all right? I want you to promise.” “I’m
sorry,” Geoff said. “I don’t see the point.” “You’ve suffered a shock. You can’t
be expected to cope at home on your own.” Geoff had simply smiled at her. “I
won’t be on my own,” he said. “I keep telling you. Caroline will look after me.”
Because of the presence of dialogue, this passage is radically different, stylistically,
from the other passage from the same novel that was shown above. However, it is clear
which of the two novels it comes from, since there are several proper-noun indicators.
Given the range of subgenres within the novel genre, i.e. narration, description, and
dialogue, this genre is a particularly bad choice for the purposes of simulating intrinsic
plagiarism, since those stylistic features which exist and might be useful for
distinguishing authors will be ultimately be drowned out by this confounding variation. Instead,
topic-related features, which would be highly unreliable in the real world (for reasons
that are obvious; what is the purpose of a student plagiarizing something that is topically
distinct from the matrix text in which it is embedded?), are strongly preferred.</p>
      <p>Thus, we argue that for the PAN evaluation to be a useful reflection of the
realworld task of intrinsic plagiarism detection, it requires data that are well-controlled for
topic. This might include changing proper nouns (and other highly topical elements)
in the data so that they are matched across authors. Insofar as clustering by authorship
(as opposed to merely detecting intrusions by one or more authors into a matrix text)
is taken as an interesting research problem that mirrors aspects of intrinsic plagiarism
detection, the evaluation needs to be constructed so that the focal task is not confounded
by orthogonal issues such as subgenre detection.
23. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the
American Society for Information Science and Technology 60(3), 538–556 (2009),
http://dx.doi.org/10.1002/asi.21001
24. Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Language Resources and</p>
      <p>Evaluation 45(1), 63–82 (2011)
25. Taboada, M., Brooke, J., Tofiloski, M., Voll, K., Stede, M.: Lexicon-based methods for
sentiment analysis. Computational Linguistics 37(2), 267–307 (2011)
26. Turney, P., Littman, M.: Measuring praise and criticism: Inference of semantic orientation
from association. ACM Transactions on Information Systems 21, 315–346 (2003)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amigó</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constraints</article-title>
          .
          <source>Information Retreval</source>
          <volume>12</volume>
          ,
          <fpage>461</fpage>
          -
          <lpage>486</lpage>
          (
          <year>August 2009</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>1555682</volume>
          .
          <fpage>1555686</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whitelaw</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chase</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hota</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levitan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Stylistic text classification using functional lexical features</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>58</volume>
          ,
          <fpage>802</fpage>
          -
          <lpage>822</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baccianella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esuli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining</article-title>
          .
          <source>In: Proceedings of the 7th conference on Language Resources and Evaluation (LREC'10)</source>
          . Valletta,
          <string-name>
            <surname>Malta</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bagga</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Entity-based cross-document coreferencing using the vector space model</article-title>
          .
          <source>In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics (ACL-COLING '98)</source>
          . pp.
          <fpage>79</fpage>
          -
          <lpage>85</lpage>
          (
          <year>1998</year>
          ), http://dx.doi.org/10.3115/980845.980859
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Biber</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Variation Across Speech and Writing</article-title>
          . Cambridge University Press (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Brants</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <source>Web 1T 5-gram Corpus Version</source>
          <volume>1</volume>
          .1.
          <string-name>
            <given-names>Google</given-names>
            <surname>Inc</surname>
          </string-name>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Brooke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammond</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
          </string-name>
          , G.:
          <article-title>Unsupervised stylistic segementation of poetry with change curves and extrinsic features</article-title>
          . In: Workshop on Computational Linguistics for
          <article-title>Literature at the 2012 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL '</article-title>
          <year>12</year>
          ) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Brooke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
          </string-name>
          , G.:
          <article-title>Native language detection with 'cheap' learner corpora</article-title>
          .
          <source>In: Proceedings of the 2011 Conference on Learner Corpus Research (LCR2011)</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Brooke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Automatic acquisition of lexical formality</article-title>
          .
          <source>In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING '10): Posters</source>
          . pp.
          <fpage>90</fpage>
          -
          <lpage>98</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Brooke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
          </string-name>
          , G.:
          <article-title>Inducing lexicons of formality from corpora</article-title>
          .
          <source>In: Proceedings of the Language Resources and Evaluation Conference (LREC '10)</source>
          , Workshop on Methods for
          <article-title>the automatic acquisition of Language Resources and their evaluation methods</article-title>
          . pp.
          <fpage>17</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Burton</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Java</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soboroff</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The ICWSM 2009 Spinn3r Dataset</article-title>
          .
          <source>In: Proceedings of the Third Annual Conference on Weblogs and Social Media (ICWSM</source>
          <year>2009</year>
          ). San Jose, CA (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dale</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chall</surname>
          </string-name>
          , J.: Readability Revisited:
          <article-title>The New Dale-Chall Readability Formula</article-title>
          .
          <source>Brookline Books</source>
          , Cambridge, MA (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Emigh</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herring</surname>
            ,
            <given-names>S.C.</given-names>
          </string-name>
          :
          <article-title>Collaborative authoring on the web: A genre analysis of online encyclopedias</article-title>
          .
          <source>In: Proceedings of the 38th Annual Hawaii International Conference on System Sciences (HICSS '05)</source>
          . Washington, DC (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirst</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marthi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Segmenting documents by stylistic character</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>11</volume>
          (
          <issue>4</issue>
          ),
          <fpage>397</fpage>
          -
          <lpage>415</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Guthrie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Unsupervised Detection of Anomalous Text</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Sheffield (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Heylighen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dewaele</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Variation in the contextuality of language: An empirical measure</article-title>
          .
          <source>Foundations of Science</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <fpage>293</fpage>
          -
          <lpage>340</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luyckx</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Intrinsic plagiarism detection using character trigram distance scores</article-title>
          .
          <source>In: Proceedings of the PAN 2011 Lab: Uncovering Plagiarism, Authorship, and Social Software Misuse</source>
          (
          <year>2011</year>
          ), http://www.webis.de/research/events/pan-11
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akiva</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dershowitz</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dershowitz</surname>
          </string-name>
          , N.:
          <article-title>Unsupervised decomposition of a document into authorial components</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL '11)</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zigdon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Determining an author's native language by mining a text for errors</article-title>
          .
          <source>In: Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD '05)</source>
          . pp.
          <fpage>624</fpage>
          -
          <lpage>628</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A solution to Plato's problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge</article-title>
          .
          <source>Psychological Review</source>
          <volume>104</volume>
          ,
          <fpage>211</fpage>
          -
          <lpage>240</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Oberreuter</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          'Huillier,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Ríos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.A.</given-names>
            ,
            <surname>Velásquez</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.D.</surname>
          </string-name>
          :
          <article-title>Approaches for intrinsic and external plagiarism detection</article-title>
          .
          <source>In: Proceedings of the PAN 2011 Lab: Uncovering Plagiarism, Authorship, and Social Software Misuse</source>
          (
          <year>2011</year>
          ), http://www.webis.de/research/events/pan-11
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Intrinsic plagiarism detection using character n-gram profiles</article-title>
          .
          <source>In: Proceedings of the SEPLN'09 Workshop on Uncovering Plagiarism</source>
          ,
          <article-title>Authorship and</article-title>
          ,
          <source>Social Software Misuse (PAN-09)</source>
          . pp.
          <fpage>38</fpage>
          -
          <lpage>46</lpage>
          . CEUR Workshop Proceedings, volume
          <volume>502</volume>
          (
          <year>2009</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>502</volume>
          /
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>