<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bootstrapped Authorship Attribution in Compression Space</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Digital Technology and Biometrics Department Netherlands Forensics Institute</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leiden University</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ramon de Graaff</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>and Cor J. Veenman</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>From a machine learning standpoint, the PAN 2012 Lab contest had one major challenge. In all authorship attribution tasks, the number of training documents was extremely low. We extended our previous work, in which compression distances to randomly selected prototype documents from the training corpus were used as feature representation. A supervised multi-class classifier was learned in the resulting feature space using the remaining documents. Inspired by the bootstrapped resampling method, we now drew document samples from the few source documents in order to obtain sufficient prototypes and samples to learn a supervised classifier. Using internal validation, we tuned the size of the document samples, compression method, distance measure, classification method, and decision threshold (open-class tasks) for optimal F1 score. With this scheme we submitted for the closed-class and open-class author identification tasks. In the overall results for these tasks we achieved a shared fourth ranking, based on the reported average recall of the 11 teams.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This years PAN 2012 Lab author identification task had two sub-tasks: the traditional
Authorship Attribution and Sexual Predator Identification. From the Authorship
Attribution sub-task our interest goes to closed-class and open-class (traditional) authorship
attribution. The second problem within this sub-task was authorship clustering or
intrinsic plagiarism, which we did not consider.</p>
      <p>The datasets provided had a very low number of candidate authors compared to
the last years contest. Moreover, per author the number of sample documents was very
low, i.e., only two documents per author. Third, the size of the sample documents was
relatively large: order 10kB - 100kB. From a machine learning standpoint, the first point
makes live easier, while the second point is a major challenge. The sample documents
that make up the training set for model learning, hardly enable to generalize with two
samples per class. The situation is even worse, to be able to do internal validation for
model selection, one document should be kept apart, so that only one document remains
for training of the recognition models.</p>
      <p>
        In our previous work [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] we proposed the Compression Distance to Prototypes
(CDP) method. We applied this method to datasets with similar characteristics as the
PAN 2011 Lab authorship attribution contest. These characteristics are a high number
of authors, per author tens of sample documents and the size of the training documents
was relatively small. In short, the CDP methods randomly selects per author a part of the
provided document corpus as prototypes. The remaining documents are used as
training set for recognition model learning. The feature representation of the training set is
computed as compression distances to the prototypes. Compression distances are
distance measures in the sense that similar documents have small distances and dissimilar
documents have larger distances.
      </p>
      <p>Without adaptation, our previous work could not be applied for the PAN 2012 Lab,
since there is only one training document per author. One possible adaptation is to
compute the compression distance from a test document to all training documents and
attribute a test document to the author of the closest training document. This
1-nearestneighbor procedure is known to be sensitive to noise or, in other words, it easily overfits
to the training corpus. Besides the 1-nearest neighbor rule, with effectively only one
document for model learning there are hardly any methods that can be applied.
Moreover, the same risk of overfitting would apply.</p>
      <p>Here, we propose an adaptation of our previous work in which we regenerate a
training set from the given corpus such that statistical model learning becomes feasible.
In the next section, we first pose the problems derived from the PAN 2012 Lab
subtask we take part in. Then, we elaborate on our method and proposed extensions. In the
following section, we apply our method to the training corpus for parameter tuning
using internal cross-validation. Among the model parameters are the compression method
for the compression distance computation and the compression distance measure itself.
Finally, we describe the results of applying the tuned models to the test corpus as
submitted for the contest. We wrap up with concluding remarks about the obtained results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem statement</title>
      <p>The problems of the traditional authorship attribution sub-task, that we considered for
the PAN 2012 Lab, are the closed-class and open-class authorship attribution problems.
As statistical pattern recognition problem, closed-class authorship attribution comes
down to a standard multi-class classification problem, where each class is one of the
known authors. Open-class authorship attribution can be seen as a multi-class
problem, where one class is added representing all unknown authors. The problem is to find
proper representations and models for the closed-class and open-class authorship
attribution task, where precision, recall and F1 measure will be used as evaluation metrics.
These measures are defined as:</p>
      <p>Precision PA for author A is defined as:</p>
      <p>PA =</p>
      <p>correct(A)
retrieved-documents(A)</p>
      <p>T PA
T PA + F PA
;
(1)
where T PA (True Positive) is the number of documents that are correctly attributed
to author A and F PA (False Positive) is the number of documents that are incorrectly
attributed to author A.</p>
      <p>Recall RA for author A is defined as:</p>
      <p>RA =</p>
      <p>correct(A)
relevant-documents(A)</p>
      <p>
        TP A
TP A + FN A
where F NA (False Negative) is the number of missed attributions to author A. The F1
measure [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is defined as the harmonic mean of recall and precision:
      </p>
      <p>These measures can be aggregated by averaging, either author based or document
based, leading to macro and micro averages, respectively [19]. For instance, the macro
averaged recall Rmacro is defined as:
where jDAi j is the number of documents in the test set for author Ai, and k = Pn
i=1 jDAi j.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>
        The method we propose for this task is based on the Compression Distance to
Prototypes (CDP) method we reported earlier in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We first summarize the CDP method
and then extend it with provisions to deal with the extremely small sample size of the
contest, i.e., one training sample per author.
      </p>
      <p>
        The CDP method deserves its name from the way the training documents are
represented, i.e., its feature representation. In contrast with typical representations for text
documents with lexical, syntactical and structural features, we represent a document
as being similar (or dissimilar) to a set of other documents. Such a dissimilarity based
representation was proposed earlier in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and has proven to give competitive
classification results. It can be favorable for obtaining lower dimensional representations,
especially if suitable distance measures are available. Importantly, the distance measure
to be used should discriminate the samples such that dissimilar samples have large
distances and similar samples have small distances. Several compression based distances
have these properties and have been applied successfully in different domains [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [18]. These compression-based approaches are practical implementations
of the information distances expressed in the non-computable Kolmogorov complexity
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we applied the Compression Dissimilarity Measure (CDM) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]:
CDM (x; y) =
      </p>
      <p>
        C(xy)
C(x) + C(y)
;
where C(x) is the size of the compressed object x and xy is the concatenation of x
and y. Essential in all these measures is a compressor that finds the smallest possible
encoding of, in this case, the sample documents. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we used the LZ76 compression
method [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The contribution of that work was to use compression-based distances as
feature representation.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Bootstrapped document samples</title>
        <p>
          After having defined the representation of the documents, we propose a way of
regenerating sample documents from the single given training document per author. This is
possible because, fortunately, the documents of the PAN 2012 Lab contest are
relatively large. We use the same idea underlying bootstrapping, a well known resampling
method for generalization error estimation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The rationale behind the bootstrapped
resampling method is the best representation of the data distribution is the given dataset
itself. In our case, this translates to: the best representation of the writing style of an
author is the one document that we have.
        </p>
        <p>The method works as follows. First we draw a prototype for the given author from
the start of the document with a certain length. The length of the prototype is a
parameter to select. Then we proceed similar to the bootstrapped resampling method with
the remaining part of the document. That is, in order to get training sample documents
written by the author, we draw from her ’model’ with replacement, where the model
is the one source document. The sampling of training samples works as follows. The
starting point in the (remaining) document is chosen randomly. Then the required
number of characters is read. In case the sample would read over the end of the document,
it continues reading at the start of the document until the required length is obtained.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Classifier learning</title>
        <p>Closed-class recognition Finally, a classifier must be learned in the compression
distance space. For this purpose any multi-class classifier can be used. For model selection
and parameter tuning we use the F1 measure, which is an aggregation of precision and
recall that will be used as performance measures in the contest.</p>
        <p>Open-class recognition For the open-class tasks, we additionally had to estimate a
threshold for deciding for an unknown author. That is, the classifier decides for the
most probable class, unless its probability is below the given threshold. In that case, it
decides none of the known classes. We estimated the threshold by trying all thresholds
in the training set that resulted in the best averaged F1 score on the test set, where one
class was left out in turn and considered as none of the known classes.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Method parameters</title>
        <p>
          Below, we list the parameters involved in the method. Some of these are selected a
priori, some are estimated in case they seem to be less dependent on the dataset and
others are optimized per dataset as will be described in the Section Experiments.
1. Distance measure: Besides the already mentioned CDM in this work we also
consider the Normalized Compression Distance (NCD) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as dissimilarity measure:
N CD(x; y) =
        </p>
        <p>
          C(xy) minfC(x); C(y)g
maxfC(x); C(y)g
(7)
2. Compressor: The compressor is the core ingredient of the compression-based
distance measures. From theory the best compressor should be used. Therefore, in this
work we additionally considered a variant of the PPM algorithm, PPMd, that is
among the best for text compression [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], [17], [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
3. Bootstrapping: The bootstrapping method adds four additional parameters: the
number of prototypes per author, the number of drawn training samples, the size of the
prototypes, and the size of training samples.
4. Classification: The classification method to be applied for closed-class and
openclass recognition.
5. Open-class recognition: the threshold for open-class recognition.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Below, we first describe the different datasets in the Authorship Attribution sub-task we
submitted our runs for. Then, we describe the way we handled the method parameters
and we list the internal cross-validation results and submitted runs of the experiments.
4.1</p>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>The PAN12 authorship identification sub-task had several datasets with different
numbers of authors. In every dataset for each author two documents were given.</p>
        <p>
          The mean document sizes and standard deviations for the datasets of PAN12 are
shown in Figure 1. Details on the datasets in PAN12 can be found in Table 1.
The parameters of the method were either chosen a priori, estimated globally for all
tasks, tuned per task by two-fold cross-validation for optimal averaged F1 score, or the
exploration of parameter was part of the experiments. Two-fold cross-validation was
conducted by taking for each author the first document as training document for
prototype sampling and bootstrapped sampling and the second for validation. The validation
document was divided up in three parts, to enable better F1 score differentiation
between several parameter settings. Then the sampling and validation set was rotated by
using the second document for training and the first for validation. This process was
repeated 10 times and the results averaged.
1. Distance measure: Preliminary experiments on the TASK I dataset showed that the
NCD and CDM distance measure performed similarly. The reported experiments
were therefore conducted with the CDM measure as before, i.e., as in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
2. Compressor: Preliminary experiments on the PAN 2011 Lab data and the PAN 2012
Lab TASK I showed that the PPMd compressor clearly outperformed the previously
used LZ76 compressor. The reported experiments were therefore conducted with
the PPMd method.
3. Bootstrapping: Because we draw the prototypes without replacement, we fixed the
number of prototypes to one per author. The size of the prototypes, the number and
size of training samples are tuned through two-fold cross-validation for optimal F1
score per task.
4. Classification: The classification method is explored in the experiments.
5. Open-class recognition: The threshold for open-class attribution is established through
two-fold cross-validation for optimal F1 score. In the averaging of F1 scores the
unknown class is considered equally important as all known authors taken together.
        </p>
        <p>
          We implemented the method in Matlab and used the pattern recognition toolbox
PRTools [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for the classification models.
4.3
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>We separate the description of the experiments in the closed-class tasks and the
openclass tasks.
4.4</p>
      </sec>
      <sec id="sec-4-3">
        <title>Closed-class</title>
        <p>
          The datasets used in these tasks consist of three, eight and fourteen authors for TASK
A, TASK C and TASK I, respectively. The sizes of the training documents for these
tasks differ quite a lot as can be seen in Table 1. In Figures 2, 3 and 4, the internal
cross-validation results can be seen for some optimized parameter settings as prototype
size and bootstrapped training sample document size. Based on these figures we set the
method parameters to be used in the runs for submission. We selected Fisher linear
discriminant [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] as classifier for all tasks and the number of bootstrapped training samples
to thirty. Further, we set the prototype size, the size of the bootstrapped training samples
as shown with the figures for the respective tasks (Figures 2, 3 and 4) and Table 2.
        </p>
        <p>For the submissions, we could exploit both training documents for each author,
since the method parameters were tuned. For the first submission, we used two
prototypes per author. That is, we took a prototype from both training documents of each
author. From the remaining part of the training documents, we drew thirty samples of a
size conform the tuned parameter specified in Table 2. For the second submission, we
used one prototype per author from the first document and sample both documents for
trainings samples with the given parameters.</p>
        <p>This resulted in models based on more training data than in the internal validation.
Expectedly, this could only improve the performance. However, the performance of
SUBMISSION 2 is quite worse than the performance of the internal validation and
SUBMISSION 1. SUBMISSION 1 performs pretty well. The performances of the two runs on
the test documents provided by PAN12 are shown in Table 3.
The training data for the open-class tasks TASK B, TASK D and TASK J is the same as
for the corresponding closed-class tasks TASK A, TASK C and TASK I.</p>
        <p>The internal validation on the open datasets is done using a ten repeat experiment
on the dataset while measuring the F1 performance. Per repeat, every author is two
times offered as the ’Unknown’, each of its training documents once. We compute the</p>
        <p>F1 score in two ways, that we denote as P N and P 50. With P N , we express that the
unknown author is as important to recognize as any single author. With P 50, we express
that the unknown author is as important as all remaining authors together. Hence, the
unknown author is weighted for 50% and the other authors together as the other 50%.</p>
        <p>In Table 5, we see that P N is higher than P 50 on every dataset. This corresponds
to our expectation, because here only n1 th, with n authors, is offered as ’Unknown’.
Distinguishing the ’Unknown’ author is here as important as attributing the test
documents to every known author. We introduced P 50 because we expect more ’Unknown’
authors in the testsets provided by PAN12 than only n1 th.</p>
        <p>
          In Table 4, the number of known versus unknown authors in the testset provided by
PAN12 is shown, as well as the optimized thresholds. In Table 6 the performances are
shown for the submissions on the testset provided by PAN12. After we optimized the
thresholds, we take the same models for SUBMISSION 1 and SUBMISSION 2 as we did
for the closed-class. That is, n prototypes for SUBMISSION 1 and 21 n for SUBMISSION
2 where n is the number of train documents. As we expect more or equal documents
of the ’Unknown’ author, we submit the models with the threshold P 50T . In TASK
B, the number of documents by ’Unkown’ authors is only four, the threshold P NT
would perform slightly better. Fortunately in TASK D, the number of documents by
’Unknown’ authors is nine, which is about half of the dataset. The threshold P 50T
performs a lot better than threshold P NT . In TASK J, both thresholds came up with
the same labeling for the test dataset. The models from SUBMISSION 2 came up with
the same labels for both thresholds on all tasks. Clearly, for optimal performance the
proportion of known and unknown authors should be known beforehand.
For the PAN 2012 Traditional Authorship Attribution tasks, we modified our previous
work to deal with the major challenge of the provided datasets. That is, for all tasks
the number of training documents per author was only two. To be able to do model
selection, one document must be kept apart, so that one document could be exploited
for model learning. We proposed a method for generating additional training documents
inspired by the bootstrapped resampling method for generalization error estimation.
Both the internal validation results and the results of the submitted runs on the test
data show, that this resulted in a promising method for closed-class and open-class
authorship attribution. In the overall results, we achieved a shared fourth ranking for
the authorship attribution tasks, based on the reported average recall of the 11 teams.
Further, the CDM compression distance in combination with the PPMd compressor
outperformed other combinations, which is in line with results reported in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>The open-class experiments showed how important it is that the training data and
test dataset have the same characteristics for statistical pattern recognition methods. In
this case, the proportion of known and unknown authors could not be derived from the
training data. We guessed the unknown authors to be as frequent as all known authors
together. Clearly, this assumption has a strong impact on the results. This was shown
in experiments in which we assumed that an unknown author to be as frequent as any
single known author.</p>
        <p>Finally, the improved performance of SUBMISSION 1 over SUBMISSION 2 shows
that with more prototypes a better representation of the documents and a better
recognition performance can be obtained. This is an aspect that should be explored further.
For instance, the number of prototypes could be optimized by exploiting the document
bootstrapping for prototypes too.
17. Shkarin, D.: PPM: one step to practicality. In: Proceedings of the Data Compression</p>
        <p>Conference. vol. DDC ’02, p. 202. IEEE Computer Society (2002)
18. Telles, G., Minghim, R., Paulovich, F.: Normalized compression distance for visual analysis
of document collections. Computers and Graphics 31(3), 327 – 337 (2007)
19. Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of
Information Retrieval 1, 69–90 (1999)</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Benedetto</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caglioti</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loreto</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Language trees and zipping</article-title>
          .
          <source>Phys. Rev. Lett</source>
          .
          <volume>88</volume>
          (
          <issue>4</issue>
          ),
          <volume>048702</volume>
          (Jan
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cilibrasi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitányi</surname>
            ,
            <given-names>P.M.B.</given-names>
          </string-name>
          :
          <article-title>Clustering by compression</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          <volume>51</volume>
          (
          <issue>4</issue>
          ),
          <fpage>1523</fpage>
          -
          <lpage>1545</lpage>
          (
          <year>Apr 2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cleary</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Data compression using adaptive coding and partial string matching</article-title>
          .
          <source>IEEE Transactions on Communications</source>
          <volume>32</volume>
          (
          <issue>4</issue>
          ),
          <fpage>396</fpage>
          -
          <lpage>402</lpage>
          (
          <year>Apr 1984</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Duda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stork</surname>
            ,
            <given-names>D.: Pattern</given-names>
          </string-name>
          <string-name>
            <surname>Classification</surname>
          </string-name>
          . John Wiley and Sons, Inc., New York (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Duin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Juszczak</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paclík</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Pe¸kalska, E.,
          <string-name>
            <surname>de Ridder</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tax</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verzakov</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>PR-Tools4.1, a Matlab toolbox for pattern recognition</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Efron</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Bootstrap methods: another look at the jacknife</article-title>
          .
          <source>Annals Statistics 7</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Keogh</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lonardi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratanamahatana</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards parameter-free data mining</article-title>
          .
          <source>In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . pp.
          <fpage>206</fpage>
          -
          <lpage>215</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kukushkina</surname>
            ,
            <given-names>O.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polikarpov</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khmelev</surname>
            ,
            <given-names>D.V.</given-names>
          </string-name>
          :
          <article-title>Using literal and grammatical statistics for authorship attribution</article-title>
          .
          <source>Problems of Information Transmission</source>
          <volume>37</volume>
          (
          <issue>2</issue>
          ),
          <fpage>172</fpage>
          -
          <lpage>184</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lambers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veenman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Forensic authorship attribution using compression distances to prototypes</article-title>
          .
          <source>In: Proceedings of the Third International Workshop on Computational Forensics</source>
          , The Hague,
          <source>The Netherlands, August 13-14</source>
          . pp.
          <fpage>13</fpage>
          -
          <lpage>24</lpage>
          . Springer-Verlag, Berlin, Heidelberg (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lempel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ziv</surname>
          </string-name>
          , J.:
          <article-title>On the complexity of finite sequences</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          <volume>22</volume>
          ,
          <fpage>75</fpage>
          -
          <lpage>81</lpage>
          (
          <year>1976</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitányi</surname>
            ,
            <given-names>P.M.B.</given-names>
          </string-name>
          :
          <article-title>The similarity metric</article-title>
          .
          <source>IEEE Transactions on Information Theory</source>
          <volume>50</volume>
          (
          <issue>12</issue>
          ),
          <fpage>3250</fpage>
          -
          <lpage>3264</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vitányi</surname>
            ,
            <given-names>P.M.B.</given-names>
          </string-name>
          :
          <article-title>An Introduction to Kolmogorov Complexity and its Applications</article-title>
          . Springer-Verlag, New York (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mahoney</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Large text compression benchmark</article-title>
          , http://www.mattmahoney.net/text/text.html
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Pe¸kalska, E.,
          <string-name>
            <surname>Skurichina</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duin</surname>
          </string-name>
          , R.:
          <article-title>Combining fisher linear discriminants for dissimilarity representations</article-title>
          .
          <source>In: Proceedings of the First International Workshop on Multiple Classifier Systems</source>
          . vol.
          <year>1857</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>126</lpage>
          . Springer-Verlag (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. van Rijsbergen,
          <string-name>
            <surname>C.</surname>
          </string-name>
          : Information Retrieval.
          <source>Butterworth</source>
          (
          <year>1979</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sculley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brodley</surname>
            ,
            <given-names>C.E.: Compression</given-names>
          </string-name>
          <article-title>and machine learning: A new perspective on feature space vectors</article-title>
          .
          <source>In: Proceedings of the Data Compression Conference</source>
          . pp.
          <fpage>332</fpage>
          -
          <lpage>332</lpage>
          . DCC '06, IEEE Computer Society, Washington, DC, USA (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>