<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EPSMS and the Document Occurrence Representation for Authorship Identification?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Graduate Program in Systems Engineering, Universidad Autónoma de Nuevo León</institution>
          ,
          <addr-line>San Nicolás de los Garza, NL 66450</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>This paper describes the participation of the PISIS team in the authorship identification track of PAN'11. We adopted two different strategies for the tasks of authorship attribution and authorship verification. For authorship attribution we performed experiments with a document occurrence representation using a standard classification-based approach. Results obtained with this approach were mixed: in the small data sets distributional representations resulted very helpful, although in the large data sets a simple bag-of-words approach outperformed the document occurrence approach. For authorship verification we adopted a classification-based approach and proposed a modification to Ensemble Particle Swarm Model Selection (EPSMS) for selecting classification models for each task. This approach obtained acceptable performance in two out of the three data sets.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Authorship attribution (AA) and authorship verification (AV) are two closely related
problems that aim at uncovering the writing style of authors [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. Applications of AA
and AV include spam filtering [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], fraud detection [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], computer forensics [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], cyber
bullying [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and plagiarism detection [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Because of its wide applicability, mainly
in security aspects, the development of automated AA techniques has received much
attention recently [
        <xref ref-type="bibr" rid="ref16 ref25 ref27">16,25,27</xref>
        ].
      </p>
      <p>
        AA is defined as the task of identifying whom, from a set of candidates, is the author
of a given document [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. While AV is the task of deciding whether given text
documents were or were not written by a certain author [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Effective methods have been
proposed for both tasks so far, see for example the methods evaluated and/or reviewed
in [
        <xref ref-type="bibr" rid="ref15 ref16 ref25 ref27 ref28">28,15,16,25,27</xref>
        ]. One of the most popular formulations for AA and AV is that based
in supervised machine learning methods, where both problems are faced as
classification tasks. More specifically, AA can be faced as one of multiclass classification, with
? The author thanks the efforts of the organizers of PAN’11 for providing a valuable forum
for the evaluation of authorship identification methods. Also, the author thanks the helpful
comments of reviewers.
as many labels as candidate authors [
        <xref ref-type="bibr" rid="ref10 ref13">10,13</xref>
        ]. AV, on the other hand, can be faced as a
binary classification problem [
        <xref ref-type="bibr" rid="ref14 ref9">9,14</xref>
        ].
      </p>
      <p>
        This paper describes the approach adopted by the PISIS1 team for the Authorship
Identification track for the PAN2 Lab at CLEF 2011, see [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] for more information on
the PAN competition and workshop series. We adopted classification-based methods
for facing both AA and AV tasks. For AA we used standard classification algorithms
with a distributional term representation for documents. Intuitively, we want to model
the writing style of authors in terms of their association with other documents, as
modeled with the document occurrence representation. Experimental results in the PAN’11
Authorship Attribution track show that proposed approach resulted very helpful for the
small data set. Results in the large data set are very competitive as well, although we
found that a simple bag-of-words representation and a nonlinear classifier outperformed
the distributional representations.
      </p>
      <p>
        For AV we used a method called Ensemble Particle Swarm Model Selection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] for
building ad-hoc classifiers for each AV task. We used sample documents by the author
as positive examples and documents written by other authors as negative examples. In
order to obtain stable predictions we adopted a meta-ensemble approach that combines
the outputs of several runs of the model selection technique. Documents are ranked by
probability that they were written by the author of interest. Experimental results in the
PAN’11 Authorship Attribution track show that the proposed approach resulted very
effective for 2 out of 3 data sets, although there are several aspects of the proposed
methodology that still can be improved.
      </p>
      <p>The rest of this working note describes in detail the methodologies adopted for the
AA and AV tasks, reports the results obtained with them and summarizes our main
findings. Before describing the proposed methodology we briefly review related work
on AA and AV in the next section.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        In the classification-based approach to AA and AV sample documents written by each
author are considered instances of an usual classification problem [
        <xref ref-type="bibr" rid="ref16 ref27">16,27</xref>
        ]. Learning
algorithms that have been used for AA and AV include support vector machine (SVM) [
        <xref ref-type="bibr" rid="ref10 ref13 ref17">10,17,13</xref>
        ]
and variants thereon [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], neural networks [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], Bayesian classifiers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], decision tree
methods [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and similarity based techniques [
        <xref ref-type="bibr" rid="ref16 ref18">18,16</xref>
        ] among several others.
      </p>
      <p>
        In the above works the same learning algorithm have been used for building the
classification models of all of the authors in consideration. An exception is the work by
Escalante et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where particle swarm model selection (PSMS) was used for building
specific classification models for each author. The hypothesis of that work is that by
considering specific methods for preprocessing, feature selection and classification for
each author will increase the classification performance. Satisfactory performance was
obtained in the task of AV (i.e., binary classification), although AA performance (i.e.,
multiclass classification) was limited (because of the incompatibility of scales for the
outputs of different models, see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). Since PSMS has proved to be very effective for
      </p>
      <sec id="sec-2-1">
        <title>1 http://pisis.fime.uanl.mx/ 2 http://www.webis.de/research/workshopseries/pan-11/</title>
        <p>
          diverse binary classification tasks [
          <xref ref-type="bibr" rid="ref5 ref6 ref9">5,6,9</xref>
          ] in this paper we adopt a modified PSMS for
the AV task and we used standard learning algorithms for the AA task in the PAN’11
Authorship Identification track.
        </p>
        <p>
          While standard learning algorithms have been used for AA and AV, a wide
diversity of features have been used for representing documents, including, character, lexical,
syntactical, grammatical and semantic, among others [
          <xref ref-type="bibr" rid="ref12 ref16 ref27">12,16,27</xref>
          ]. Nevertheless, the most
used representation is still the one based on the bag-of-words formulation. In particular,
the bag-of-words formulation using character n-grams has terms have been successfully
used by several researchers [
          <xref ref-type="bibr" rid="ref10 ref13 ref22 ref9">9,10,13,22</xref>
          ]. In this paper we adopted an extended
bagof-words representation for documents called the document occurrence representation
(DOR) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Under DOR documents are represented by a distribution of occurrences
over other documents in the corpus, in such a way that documents are represented by
their context. DOR has been successfully used in term clustering [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], word sense
disambiguation [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and multimedia image retrieval [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Authorship verification</title>
      <p>Three AV tasks we evaluated in the PAN’11 authorship identification track. For each
task organizers provide sample documents written by the author (training set) and
documents written by the author and other authors (validation set). The developed method
was tested in documents from the test set (of course, labels in the test set were not
available to participants during the competition). Since both, training and validation data are
available during development we merged the documents in the training and validation
sets for training our method. Table 1 shows the number of documents written and not
written by the author of each data set in the training, validation and test sets.
In our approach to AV we used documents in the training and validation sets as training
data for training a classifier that discriminates between documents written by the author
and documents written by any other author. Documents were represented by their
bagof-words using character n-grams as terms, with n = 3. Spaces and punctuation marks
were considered characters. We did not use the distributional term representation for
this task because of the small number of documents in the training and validation sets,
see Section 4.
3.2</p>
      <sec id="sec-3-1">
        <title>Classification approach</title>
        <p>
          Once that documents are represented by their bag-of-words we used Ensemble
Particle Swarm Model Selection (EPSMS) for the selection of classification models for
each data set. EPSMS is a method for the automatic selection of binary classification
models [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In a nutshell, EPSMS searches for the best ensemble method that can be
generated by using the methods available in a machine learning toolbox 3. An ensemble
is a classification model that combines the outputs of several classifiers. Under
certain conditions, it has been shown that ensembles can achieve better performance than
individual models [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In previous work we have shown that EPSMS can select very
effective ensemble classification models [
          <xref ref-type="bibr" rid="ref4 ref6">6,4</xref>
          ]. A distinctive feature of EPSMS is that
each of the members of the ensemble is a method that differs in terms of preprocessing
method, feature selection technique and learning algorithm. The heterogeneity of the
considered models (diversity) together with the competitive accuracy (performance) of
models guarantee selecting very effective classification models. See [
          <xref ref-type="bibr" rid="ref4 ref5 ref6">6,4,5</xref>
          ] for further
details on EPSMS.
        </p>
        <p>
          For each AV data set we provide as input to EPSMS the training+validation data and
EPSMS returns a ensemble classifier. Although EPSMS provides very stable
classification models [
          <xref ref-type="bibr" rid="ref4 ref6">6,4</xref>
          ] in this work we wanted to obtain even more stable models.
Therefore, we adopted a meta-ensemble approach in which the outputs of several ensembles
(each one selected with EPSMS) were combined. The intuition behind this technique is
that by running EPSMS several times and combining the outputs of the corresponding
methods we could more stable predictions. Stability is very important in EPSMS as this
method is based in a heuristic search method, besides the search space contains many
local minima.
        </p>
        <p>The meta-ensemble approach is as follows. For each AV data set we ran EPSMS 5
times. Then the selected ensembles were applied to the test data set. As a result we have
for each test document the five outputs provided by the 5 ensembles. The output of each
ensemble is a real number between [0; 1] expressing the confidence that the sample
belongs to the positive class. The outputs of each ensemble are sorted in descending
order in such a way that test documents that are more likely to belong to the positive
class (i.e., documents written by the author) are ranked in the first positions. For each
ranking we keep the top-10 ranked documents. Then, for each document in the union
of the 50 documents we count the number of rankings in which they appeared within
the top-10 positions(a number between 1 and 5). Finally, we sort the test documents by
this number and assign the positive label to the top 10 ranked documents.</p>
        <p>Our hypothesis with this method is that if the document is likely to be written by
the author it is very likely that the document will receive a high score from several
EPSMS ensembles. Documents not written by the author of interest may appear in the
top ranked documents for one or two ensembles, although it is reasonable to assume
that the top ranked documents are those with more chances to be written by the author.</p>
        <sec id="sec-3-1-1">
          <title>3 http://clopinet.com/CLOP</title>
          <p>The choice of the top 10 ranked documents was done by analyzing the outputs of the
different ensembles. We found that after 10 documents most of the documents received
very similar scores in the test set.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Authorship attribution</title>
      <p>
        As mentioned in the related work section, the performance of PSMS [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and EPSMS [
        <xref ref-type="bibr" rid="ref4 ref8">8,4</xref>
        ]
for multiclass classification models is not as good as for binary classification tasks.
Therefore, we decided to adopt a different approach for the AA task. In particular, we
focused on the evaluation of an extended bag-of-words representation for documents
and used a standard classification model. Table 2 summarizes the main statistics of the
AA data sets for the PAN’11 Authorship Identification track.
The bag-of-words representation using character n-grams as terms is among the most
used representations for documents in AA [
        <xref ref-type="bibr" rid="ref10 ref13 ref16 ref22 ref27 ref9">9,10,13,16,22,27</xref>
        ]. Despite the fact that
acceptable performance has been obtained with such representation in AA, we think that
results obtained with such representation can be improved by adopting extended
representations. Several extensions to the bag-of-words approach has been proposed in
closely related fields as information retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], computational linguistics [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and
machine learning [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In this work we explore the suitability of the document
occurrence representation (DOR) for document representation in AA.
      </p>
      <p>
        DOR is a distributional term representation in which a document is represented by a
distribution of occurrences over other documents in the same corpus [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Intuitively, a
document is represented by its context. The process for obtaining the DOR
representation for documents is as follows. First, each term in the vocabulary is first represented as
a distribution of occurrences over documents. Next, each document is then represented
by a combination of the representations of terms that occur in the document.
      </p>
      <p>
        DOR is considered the dual of the tf-idf representation for representing documents:
as documents can be represented by a distribution over the terms, terms can be
represented by a distribution over documents. Each term tj in the vocabulary V is represented
(1)
(2)
by a vector of weights wdor =&lt; wjd;o1r; : : : ; wjd;oNr &gt;, where N is the number of
docuj
ments in the collection and 0 wjd;okr 1 represents the contribution of document dk to
the representation of tj . Specifically, we consider the following weighting scheme [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]:
wdor(tj ; dk) = df (tj ; dk)
log jV j
      </p>
      <p>Nk
where Nk is the number of different terms that appear in document dk and df (tj ; dk) is
given by:
df (tj ; dk) =
1 + log #(tj ; dk) if #(tj ; dk) &gt; 0
0 otherwise
where #(tj ; dk) denotes the number of times term tj occurs in document dk. The
weights are normalized using cosine normalization. Intuitively, the more frequent the
term tj occurs in document dk, the more important dk is to characterize the semantics
of tj ; on the other hand, the more different terms occur in dk, the less it contributes to
characterize the semantics of tj .</p>
      <p>Once that each term is represented according to Formula (1) each document is
represented by the unweighed sum of the representations of terms that appear in the
document. In this way, a document is represented as a distribution of occurrences over other
documents in the collection. Our hypothesis on the use of DOR for AA is that the
expanded representations are more descriptive than the usual bag-of-words approach. We
did not use this representation for the AV task because the number of documents in
the different tasks are very small (see Table 1), which resulted in very low dimensional
representations of documents.
4.2</p>
      <sec id="sec-4-1">
        <title>Classification approach</title>
        <p>
          For classification we used the neural network classifier implemented in the CLOP
toolbox [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. We selected this classifier after performing a preliminary evaluation of several
classification algorithms. We found that the combination of DOR representation and
neural network classifier achieved the highest performance in the validation data sets.
For the standard data sets, see Table 2, we used a straight multiclass classifier
(onevs-all approach), where a class corresponds to an author. For the plus data sets (i.e.,
data sets that contain documents not written by any author in the training set). We used
a multiclass classifier with an extra class: unknown author. We just considered
documents not written by any author in the training set as another author. Recall, we used
training+validation data for training the classifiers.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>This section reports the results obtained with the proposed methods in the authorship
attribution track of PAN’11. We first analyze the performance of the AV methods and
then that of the AA techniques.
5.1
Table 3 shows the results obtained in the AV data sets. The results are mixed: our
EPSMS approach obtained the first position in the second data set, although it was
ranked ninth in the third data set. For the data set Verify-1, a single document written
by the author (out of three available) was identified, this document was ranked second
according to the weights generated with the meta-ensemble approach. The other two
relevant documents did not appear in the top ranked documents for any of the 5
ensembles. For the Verify-2 data set 4 out of 5 documents were identified by the EPSMS
approach, while no author was correctly identified for the Verify-3 data set 3.</p>
      <p>From Table 1 we can see that the problems are imbalanced and the fact that
negative examples (documents written by other authors) are made of documents from
different authors further complicated the classification problem. Nevertheless, the results
obtained with the proposed formulation are interesting and give evidence that the
classification approach to AV can be very effective. We believe the proposed method has
potential for this and other binary classification tasks, although we would like to
conduct an extensive evaluation of the proposed approach in order to detect what factors
influence the performance of the proposed technique. A limitation of the proposed
approach is that it ranks documents that are more likely to be written by the author, and
then a threshold (top 10-ranked documents) must be used for determining what
documents were written by the author. In future work we would like to study alternative
formulations for the combination of the outputs from different ensembles.
5.2</p>
      <sec id="sec-5-1">
        <title>Authorship attribution</title>
        <p>Table 4 shows the official results obtained by our methods in the AA task. Overall, we
can say that results were very competitive. Our entries were above the average
performance among other participants. The results were particularly positive in the Small data
sets where our method is ranked second and third. Interestingly, the DOR representation
resulted more helpful for the data sets that included authors not seen in the training set.
Giving evidence that a classification approach for modeling unknown authors can be an
effective solution for this AA scenario. The performance in terms of macro and micro
average measures was proportional.</p>
        <p>In order to evaluate the advantage of the DOR representation over a standard
bag-ofwords formulation we performed post-competition experiments4. We performed
exper4 Participants were provided with the labels for test set documents after the competition finished.
iments using the same classification-based approach described in Section 4, although
using a binary bag-of-words representation with character n-grams as terms. The same
neural network with the same (default) parameters as used with the DOR representation.
Table 5 shows the performance obtained by the classifier with both representations.</p>
        <p>Results are mixed: for the Small data sets the DOR representation outperformed
the performance of the bag-of-words formulation. While the improvement in terms of
accuracy is considerable the ranking of both methods was not significantly affected. On
the contrary, in the Large data sets the bag-of-words approach outperformed the DOR
representation. The differences in all of the measures are considerable (more than 10%
in accuracy). Note that the ranking for the Large data sets are considerably reduced for
the bag-of-words approach. This result was somewhat unexpected, as one may think
that since in the Large data sets we have more documents available, the DOR
representations can be more informative (richer). We think that this results are due to the fact that
having more classes there can be an overlap in the representations for documents that
belong to different classes. We will try to clarify this behavior in future work. Another
issue could be that the number of documents over which compute the DOR
representation (and even the selection of which documents are used) can have an important impact
into the performance of methods based on this representation.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>We described the methods adopted for the PAN’11 Authorship Identification track.
Different methods were proposed for the attribution (AA) and verification (AV) tasks. For
AV we used EPSMS a tool for the automated selection of ensemble classifiers. Our
results show that EPSMS is a very competitive method although it still can further
improved. In particular we would like to study different ways to determine that a document
has/hasnot been written by an author from the outputs of several ensembles selected
with EPSMS. For AA we adopted the document occurrence representation and used a
standard classifier. We found that in the Small data sets the DOR representation resulted
very helpful, although it was not the case for the Large data sets. It is interesting, and
somehow disappointing, that a simple bag-of-words representation outperformed the
DOR-based approach in the Large data sets. We would like to analyze in more detail
the benefits of DOR for AA and what factors affect the performance of methods based
on that representation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Carrillo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eliasmith</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Lopez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Combining text vector representations for information retrieval</article-title>
          .
          <source>In: Proc. of the 12th International Conference on Text, Speech and Dialogue (TSD)</source>
          .
          <source>LNCS</source>
          , vol.
          <volume>5729</volume>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>31</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Coyotl-Morales</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villaseñor-Pineda</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution using word sequences</article-title>
          .
          <source>In: Proc. of 11th Iberoamerican Congress on Pattern Recognition. LNCS</source>
          , vol.
          <volume>4225</volume>
          , pp.
          <fpage>844</fpage>
          -
          <lpage>852</lpage>
          . Springer, Cancun, Mexico (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dietterich</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Ensemble methods in machine learning</article-title>
          .
          <source>In: Proc. of the First workshop on Multiple Classifier Systems. LNCS</source>
          , vol.
          <year>1857</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . Springer (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Altamirano</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reta</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reyes</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosales</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Acute leukemia classification with ensemble particle swarm model selection</article-title>
          .
          <source>Artificial Intelligence in Medicine Submitted</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sucar</surname>
          </string-name>
          , E.:
          <article-title>Particle swarm model selection</article-title>
          .
          <source>Journal of Machine Learning Research 10(Feb)</source>
          ,
          <fpage>405</fpage>
          -
          <lpage>440</lpage>
          (
          <year>February 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sucar</surname>
          </string-name>
          , E.:
          <article-title>Ensemble particle swarm model selection</article-title>
          .
          <source>In: Proc. of the World Congress on Computational Intelligence</source>
          . pp.
          <fpage>1814</fpage>
          -
          <lpage>1821</lpage>
          . IEEE, Barcelona, Spain (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sucar</surname>
          </string-name>
          , E.:
          <article-title>Multimodal indexing based on semantic cohesion for image retrieval</article-title>
          . Information Retrieval In press (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sucar</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          :
          <article-title>An energy-based model for image annotation and retrieval</article-title>
          .
          <source>Computer Vision and Image Understanding</source>
          <volume>115</volume>
          (
          <issue>6</issue>
          ),
          <fpage>787</fpage>
          -
          <lpage>803</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villaseñor</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Particle swarm model selection for authorship verification</article-title>
          .
          <source>In: Proc. of the 14th Iberoamerican Congress on Pattern Recognition. LNCS</source>
          , vol.
          <volume>5856</volume>
          , pp.
          <fpage>563</fpage>
          -
          <lpage>570</lpage>
          . Springer, Guadalajara, Mexico (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Local histograms of character n-grams for authorship attribution</article-title>
          .
          <source>In: Proc. of the 49th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <fpage>288</fpage>
          -
          <lpage>298</lpage>
          . ACL (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Church</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yarowsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>A method for disambiguating word senses in a large corpus</article-title>
          .
          <source>Computers and the Humanities</source>
          <volume>26</volume>
          (
          <issue>5</issue>
          ),
          <fpage>415</fpage>
          -
          <lpage>439</lpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Grieve</surname>
          </string-name>
          , J.:
          <article-title>Quantitative authorship attribution: An evaluation of techniques</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>22</volume>
          (
          <issue>3</issue>
          ),
          <fpage>251</fpage>
          -
          <lpage>270</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Houvardas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>N-gram feature selection for author identification</article-title>
          .
          <source>In: Proc. of the 12th International Conference on Artificial Intelligence: Methodology, Systems, and Applications. LNCS</source>
          , vol.
          <volume>4183</volume>
          , pp.
          <fpage>77</fpage>
          -
          <lpage>86</lpage>
          . Springer, Varna, Bulgaria (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>L.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>B.C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debbabi</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>E-mail authorship verification for forensic investigation</article-title>
          .
          <source>In: Proc. of the 2010 ACM Symposium on Applied Computing</source>
          . pp.
          <fpage>1591</fpage>
          -
          <lpage>1598</lpage>
          . SAC '10,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/1774088.1774428
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Joula</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <year>233U</year>
          ˝
          <fpage>334</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computational methods in authorship attribution</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology 60</source>
          ,
          <fpage>9</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
          </string-name>
          , J.:
          <article-title>Authorship verification as a one-class classification problem</article-title>
          .
          <source>In: Proc. of the twenty-first international conference on Machine learning</source>
          . pp.
          <fpage>62</fpage>
          -.
          <source>ICML 04</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2004</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/1015330.1015448
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Lambers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veenman</surname>
            ,
            <given-names>C.J.:</given-names>
          </string-name>
          <article-title>Forensic authorship attribution using compression distances to prototypes</article-title>
          .
          <source>In: Computational Forensics, Lecture Notes in Computer Science</source>
          , Volume
          <volume>5718</volume>
          .
          <source>ISBN 978-3-642-03520-3</source>
          . Springer Berlin Heidelberg,
          <year>2009</year>
          , p.
          <fpage>13</fpage>
          . LNCS, vol.
          <volume>5718</volume>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>24</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Lavelli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanoli</surname>
          </string-name>
          , R.:
          <article-title>Distributional term representations: An experimental comparison</article-title>
          .
          <source>In: Proc. of the International Conference of Information and Knowledge Management</source>
          . pp.
          <fpage>615</fpage>
          -
          <lpage>624</lpage>
          . ACM Press (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Lebanon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dillon</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The locally weighted bag of words framework for document representation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>8</volume>
          ,
          <fpage>2405</fpage>
          -
          <lpage>2441</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft</surname>
          </string-name>
          , W.B.:
          <article-title>Term clustering of syntactic phrases</article-title>
          .
          <source>In: Proc. of the 13th International ACM SIGIR Conference on Research and Development in Informaion Retrieval</source>
          . pp.
          <fpage>385</fpage>
          -
          <lpage>404</lpage>
          . ACM Press, Bruxelles, Belgium (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Luyckx</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution and verification with many authors and limited data</article-title>
          .
          <source>In: Proc. of the 22nd International Conference on Computational Linguistics</source>
          . vol.
          <volume>1</volume>
          , pp.
          <fpage>513</fpage>
          -
          <lpage>520</lpage>
          . ACM Press, Manchester, UK (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Pillay</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solorio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution of web forum posts</article-title>
          .
          <source>In: Proc. of the eCrime Researchers Summit (eCrime)</source>
          ,
          <year>2010</year>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . IEEE, Dallas, TX, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Plakias</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>Author identification using a tensor space representation</article-title>
          .
          <source>In: Proc. of the 18th European Conference on Artificial Intelligence</source>
          . vol.
          <volume>178</volume>
          , pp.
          <fpage>833</fpage>
          -
          <lpage>834</lpage>
          . IOS Press, Patras, Greece (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An evaluation framework for plagiarism detection</article-title>
          .
          <source>In: Proc. of the 23rd International Conference on Computational Linguistics (COLING</source>
          <year>2010</year>
          ). pp.
          <fpage>997</fpage>
          -
          <lpage>1005</lpage>
          . ACL (
          <year>August 2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Saffari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guyon</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Quickstart guide for CLOP</article-title>
          .
          <source>Tech. rep.</source>
          , Graz University of Technology and Clopinet (May
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A survey of modern authorship attribution methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agirre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <source>Proc. of the 3rd international workshop on uncovering plagiarism</source>
          , authorship, and
          <article-title>social software misuse</article-title>
          ,
          <source>PAN'09</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Tearle</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , K.,
          <string-name>
            <surname>Demuth</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>An algorithm for automated authorship attribution using neural networks</article-title>
          .
          <source>Literary and Linguist Computing</source>
          <volume>23</volume>
          (
          <issue>4</issue>
          ),
          <fpage>425</fpage>
          -
          <lpage>442</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>de Vel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corney</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohay</surname>
          </string-name>
          , G.:
          <article-title>Multitopic email authorship attribution forensics</article-title>
          .
          <source>In: Proc. of the ACM Conference on Computer Security - Workshop on Data Mining for Security Applications</source>
          . Philadelphia, PA, USA. (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>