<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Complexity Measures and POS N-grams for Author Identification in Several Languages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>SINAI at PAN@CLEF</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rocío López-Anguita</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arturo Montejo-Ráez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel C. Díaz-Galiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Estudios Avanzados en TIC Universidad de Jaén</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our approach and results for the 2018 PAN Author Identification Task. In this task, we are given a set of documents (known fanfics) by a small number (up to 20) of candidate authors. All documents are in the same language that may be English, French, Italian, Polish, or Spanish. The task consists in developing a system to identify the authors of another set of documents (unknown fanfics). We have used two strategies to solve this task. The first strategy has consisted in using several measures of the complexity of the fanfics texts for each candidate. In the second strategy, we analyzed the fanfics of each candidate by applying a Part-Of-Speech Tagger and a n-gram based vector space model.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        This year’s Author Identification task [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] in PAN@CLEF [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] is divided into
crossdomain authorship attribution and style change detection. In our work, we have focused
on the first case. The goal of this task is to find which are the best approaches to model
an author style of writing so new texts can be attributed. This year the task has moved to
the classification of what is called fanfics which are narrative texts on topics and subjects
originally created by other authors but written by fan readers of these referred authors.
For example, we can have small stories on Harry Potter written by people different
from J.K Rowling. This new challenges emphasizes the relevance of the writing style
rather than the topic of the content. Nevertheless, content could be also important, as
the richness of vocabulary or the use of certain words and expressions could be inherent
characteristics of a certain author.
      </p>
      <p>We wanted to study the effectiveness of well-known text complexity measures as
features for this text classification task. Also, inspired by other features not derived
from content meaning or related to content topics, we have applied a POS tagger to use
n-grams of POS tag sequences as features in a vector space model. Our results show
that complexity measures are not a good source of information for this task, whereas
POS sequences behave reasonably well in certain languages.</p>
      <p>
        The article is organized as follows: Section 2 introduces the complexity measures
used for modeling texts; Section 3 describes in detail the approaches followed in this
task; Section 4 describes the experiments carried out and shows the results obtained.
Section 5 closes the article with some reflections.
In this subsection, we will take a look at the different metrics of complexity that have
been proposed by various authors. Some of these measures directly provide the
recommended age for a reader, such as the García López [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] measure, others offer more
difficult to interpret indexes, such as lexical complexity of Anula [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the sentence
complexity index or the depth dependency tree of Saggion [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], among others.
Punctuation Marks This measure was proposed by Saggion [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. The average number
of punctuation marks is used as one of the complexity indicators of the text.
Sentence Complexity The sentence complexity index was proposed by Anula [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
It measures the number of words per sentence, thus obtaining an index on sentence
length, and also measures the number of complex sentences per sentence, from an index
of complex sentences. Among complex sentences are those with composed verbs, for
instance.
      </p>
      <p>
        Automated Readability Index Senter and Smith [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] proposed one of the most widely
used indexes due to its ease of calculation. Measures the difficulty of a text from the
average number of characters (letters and numbers) per word and the average number
of words per sentence.
      </p>
      <p>
        Readability The Readability is a formula for calculating the readability of a text. It
provides an index between 0 and 100 and was developed by Muñoz [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This measure
focuses on measuring the number of words, the average number of letters per word and
their variance.
      </p>
      <p>
        Dependency Tree Height This measure was also proposed by Saggion [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. It is a
very useful metric for capturing syntactic complexity: long sentences can be
syntactically complex or contain a large number of modifiers (adjectives, adverbs or adverbial
phrases). The latter do not increase syntactic complexity and do not lead to very deep
trees, while the former have a strong tendency to produce deep trees.
      </p>
      <p>
        Gunning Fog Score (FOG) This measure was developed in 1952 by Robert Gunning
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This index is a readability test for English and Polish writing. The index estimates
the years of formal education a person needs to understand the text on the first reading.
Flesch The most popular English formula for calculating readability was proposed by
Rudolf Flesch [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Measures the difficulty of a text from the average number of syllables
per word and the average number of words per sentence.
      </p>
      <p>
        Flesch-Kincaid Rudolph Flesch, is the co-author of this formula along with John P.
Kincaid [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This Index improves upon the Flesch Reading Ease Readability formula
and estimates the number of years of education in the American school system
necessary for comprehending a given text.
      </p>
      <p>
        SMOG G. Harry McLaughlin created the SMOG Readability Formula in 1969 in an
article [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This formula estimates the years of education a person needs to understand
a text through the number of sentences and words with three or more syllables.
Lexical Complexity This measure of complexity was proposed by Anula [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to
measure the lexical complexity of a text, determined in the basis of its frequency of use and
lexical density. It is considered that the greater the lexical density (the greater the
number of different words per text), the greater the difficulty of comprehension becomes.
Spaulding Readability The readability of Spaulding, commonly known as the SSR
index, was proposed by Spaulding [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. It focuses on measuring vocabulary and sentence
structure to predict the relative readability of a text.
      </p>
      <p>
        Fernández-Huerta Readability Blanco Pérez [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Gutiérrez Couto [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] propose
this measure of complexity as an adaptation to Spanish of the Flesch readability test
([
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). It is based on the fact that in Spanish the words have more syllables on average
and the sentences are also longer. Measures the average number of syllables per word
and the average number of words per sentence in the text.
      </p>
      <p>
        Flesch-Szigrist Readability (IFSZ) The works of Granada Barrio-Cantalejo [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and
Ramírez-Puerta [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] proposes the Flesch-Szigristzt readability index as a modification
of the Flesch [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] formula adapted to Spanish. The IFSZ readability index is considered
a reference for the Spanish language. Measures the number of syllables per word and
the number of words per sentence in the text.
      </p>
      <p>
        Gutierrez Readability It was created for Spanish by Rodríguez [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and consists of a
mathematical formula, generated by multiple regression methods, which includes
certain linguistic characteristics of the material whose difficulty is to be evaluated. It
focuses on measuring the average number of letters per word and the average number of
words per sentence.
      </p>
      <p>
        Minimum Age of Readability In the work of García López [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] we can find another
formula to estimate the required age of a reader needed to understand a text. It is, again,
an adaptation to Spanish of Flesch’s original formula ([
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) for English. Measures the
average number of syllables per word and the average number of words per sentence to
obtain the minimum age necessary to understand a text.
      </p>
      <p>
        SOL Contreras [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposes the SOL metric as an adaptation to Spanish and French of
the SMOG formula proposed by Mc Lauglin [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. It measures the readability of a text
by the grade level, which is the number of years of school required to understand the
text.
      </p>
      <p>
        Crawford [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] This measure was proposed by Alan N. Crawford in 1989 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. It is used
to calculate the years of school required to understand a text. Measures the number of
sentences per hundred words and the number of syllables per hundred words.
Kandel-Models Kandel and Models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] propose this measure of complexity as an
adaptation to French of the Flesch readability test ([
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). It is based on the fact that in
French the words have more syllables on average and the sentences are also longer.
Measures the average number of syllables per word and the average number of words
per sentence in the text.
      </p>
      <p>
        Dale Chall This formula was inspired by Rudolf Flesch’s Flesch-Kincaid readability
and was created by Edgar Dale and Jeanne Chall [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Measures the complexity of a
text, determined by the difficulty of the words and the average number of words per
sentence.
      </p>
      <p>
        Flesch-Vaca In 1972, Roberto Vacca and Valerio Franchina [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] propose this measure
of complexity as an adaptation to Italian of the Flesch readability test ([
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). Measures
the average number of syllables per word and the average number of words per sentence
in the text.
      </p>
      <p>
        Gulpease The Gulpease index is a readability index for Italian text. It was defined as
part of the research of the GULP (Groupe Pedagogique Linguistique Universitaire) at
the Seminar of Educational Sciences of the University of Rome "La Sapienza". [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
Compared to other indexes, it has the advantage of using the length of words in letters
instead of in syllables, which simplifies the automatic calculation. It provides an index
between 0 and 100, with "100" indicating the highest readability and "0" indicating the
lowest readability.
      </p>
      <p>
        Pisarek The most popular Polish formula for calculating readability was proposed in
the 1960s by Walery Pisarek [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], based on research by Rudolf Flesch [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and Josef
Mistrik [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. It takes into account only two characteristics of the text: the average length of
a sentence and the percentage of "potentially difficult" words (longer than three
syllables).
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>In this section, we present the different approaches that we have applied in our
participation in CLEF PAN 2018 Author Identification Task.</p>
      <p>Our hypothesis is that, since complexity metrics capture different aspects of text
complexity, they should be valid as characteristics in a model that represents each
document in this author identification process. Additionally, we will compare these results
with those obtained with other models of the text but not associated with complexity,
such as the Part-Of-Speech Tagger vectors with TF and TF.IDF representations.</p>
      <p>In short, we have tried two main approaches, both using supervised learning
algorithms. The two approaches are as follows:
– Vectors of complexity measures of the fanfics.
– Vectors of n-grams of Part-Of-Speech tags of the fanfics.</p>
      <p>
        For both cases, texts have been processed using the Freeling1 toolkit [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] for
tokenize words and punctuation, along with sentence splitting. Only in the case of Polish
and Italian we have implemented our own tokenizer, using regular expressions and the
NLTK2 library [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. All our scripts have been coded in Python. No special treatment on
speech expressions have been performed.
      </p>
      <p>
        The classification was performed by applying the Support Vector Classification
provided by the SciKit-Learn3 library [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The multiclass support is handled according to
a one-vs-one scheme and the cost value was fixed to 1 to avoid discrimination among
classes. The kernel function used was Radial Basis Function (RBF).
      </p>
      <p>In subsections 2.1 and 2.2, the systems developed for the two approaches are
described.
3.1</p>
      <sec id="sec-2-1">
        <title>Complexity measures of the fanfics</title>
        <p>In this approach, we obtain a vector of features for each text. This features vector is
made up of the values obtained from the complexity measures. We have to distinguish
between the different languages since complexity measures depend on them as can be
seen in Table 1</p>
        <p>Once we have the feature vector, an automatic classifier is trained with a SVM
classifier, on known fanfics and applied to predict which are the closest authors for
unknown fanfics.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Part-Of-Speech Tagger of the fanfics</title>
        <p>In this approach, we obtain the POS tags for each fanfic text and apply the TF or TF.IDF
to do the automatic classification the SVM algorithm. For this, we train with fanfics that
have known candidates and predict the candidates of the unknown fanfics to obtain final
measurements of the classifier’s performance.</p>
        <p>For English, Spanish and French we have used the Freeling tool to process texts and
Python’s SciKit-Learn libraries for automatic learning.</p>
        <p>For Italian and Polish we used the NLTK Python library, and for Polish we used
TreeTagger to get the POS tags for each text. We have also used Python libraries for
automatic learning.
1 http://nlp.lsi.upc.edu/freeling/
2 http://www.nltk.org
3 http://scikit-learn.org</p>
        <p>Punctuation Marks
Sentence Complexity Index
Automated Readability Index</p>
        <p>Readability
Dependency Tree Height</p>
        <p>FOG</p>
        <p>Flesch
Flesch-Kincacid</p>
        <p>SMOG
Lexical Complexity</p>
        <p>Spaulding Readability
Fernández-Huerta Readability</p>
        <p>Flesch-Szigrist Readability</p>
        <p>Gutierrez Readability
Minimum Age of Readability</p>
        <p>SOL</p>
        <p>Crawford
Kandel-Models</p>
        <p>Dale Chall
Flech-Vaca
Gulpease
Pisarek
!
!
!
!
!
!
!
!
!
%
%
%
%
%
%
%
%
%
%
%
%
%
!
!
!
!
!
%
%
%
%
!
!
!
!
!
!
!
!
%
%
%
%
%
!
!
!
!
%
%
%
%
%
%
%
%
%
%
%
!
%
!
!
%
%
%
!
%
!
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
%
!
!
%
!
%
!
%
%
!
!
!
%
%
%
%
%
%
%
%
%
%
%
%
%
!
As indicated above, on the one hand we have evaluated the contribution of these
complexity metrics to an author identification task through automatic learning. On the other
hand, we have evaluated the contribution of Part-Of-Speech tags with TF and TF.IDF
representation to this task.</p>
        <p>The experiments we have conducted are parametrized on options like how the
normalization of the vectors is computed (L2 over samples or over features), the weighting
scheme used (TF or TF.IDF) or the maximum length of POS n-grams considered (from
2 up to 4). Thus, we have tried the runs, on the training set, shown in Table 2</p>
        <p>Of all the experiments carried out on the training set (which results are shown here),
we have selected those results that are more representative of our work. Our team
submitted both systems, though only the script using complexity measures as features
produced correct output with the test set.</p>
        <p>The results obtained with the Complexity per sample L2 experiments can be seen
in Table 3 and, as can be observed, they are quite low.</p>
        <p>Features</p>
        <p>L2-normalization N-gram sizes Weighting scheme
Complexity measures
Complexity measures</p>
        <p>POS tags
POS tags
POS tags
POS tags
POS tags
POS tags
POS tags
POS tags
per sample
per feature
per sample
per feature
per sample
per feature
per sample
per feature
per sample
per feature</p>
        <p>The highest results we on the training set are with the POS-TF (1,2,3,4-grams) with
L2-normalization per sample, as shown in Table 4.</p>
        <p>The tables above show the best results obtained on the two approaches studied in
this work. We can see that there exists significant differences in the performance of
these models depending on the problem and on the language. POS-tags based features
behave better than complexity measures. Although F1 performance remains below 0.5
in most cases, there are some impressive results for Spanish in Problem00002, which
reaches a 0.838 score.
4.1</p>
      </sec>
      <sec id="sec-2-3">
        <title>Official results</title>
        <p>
          Only the script where complexity measures where used as features for the classifier
produced valid output on the test set in the TIRA [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] server (the platform used to run
Problem 00001 English
Problem 00002 English
Problem 00003 French
Problem 00004 French
Problem 00005 Italian
Problem 00006 Italian
Problem 00007 Polish
Problem 00008 Polish
Problem 00009 Spanish
Problem 00010 Spanish
experiments). As seen above for experiments with training data, this approach results in
a poor performance. The results obtained reported an overall score of 0.149 in
macroF1. Table 5 shows official results obtained.
        </p>
        <p>Problem</p>
        <p>Macro F1
Problem 00001
Problem 00002
Problem 00003
Problem 00004
Problem 00005
Problem 00006
Problem 00007
Problem 00008
Problem 00009
Problem 00010
We have computed several complexity measures for different languages and tested their
convenience as features for authorship identification. Also, POS-tag n-grams have been
explored as features for this task. From our experiments and the results obtained we
can conclude that the complexity metrics considered are not very helpful to identify
the author of a text. This could be explained by the low number of aspects captured by
these features, which basically rely on length sentences or the number of syllables in a
word. Also, the merge of all this smaller characteristics (rare words, punctuation marks,
sentence length...) into a final index of readability or complexity may have nothing to
do with author style or characterization.</p>
        <p>Our second approach, the use of POS-tags, seems a better approach to the problem,
although results are from very bad (in the case of Polish, Problem00007) to very good
(Spanish, Problem00002). These results need of further analysis.</p>
        <p>
          As future work, we plan to combine both approaches and to use base metrics in
complexity indexes rather than the final values proposed by complexity related
formulas. Language modeling approaches appear as a natural way of author representation,
but these models need of far more data to be trained than just 20 texts per class. In this
case, a model based on bayesian networks [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] could be used, as these models do not
need large training data sets.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the Spanish Government under the REDES
Project (TIN2015-65136-C2-1-R).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anula</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Lecturas adaptadas a la enseñanza del español como l2: variables lingüísticas para la determinación del nivel de legibilidad. La evaluación en el aprendizaje y la enseñanza del español como LE L 2</article-title>
          ,
          <fpage>162</fpage>
          -
          <lpage>170</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , E.:
          <article-title>Nltk: the natural language toolkit</article-title>
          .
          <source>In: Proceedings of the ACL 2004 on Interactive poster and demonstration sessions</source>
          . p.
          <fpage>31</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Blanco</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Gutiérrez Couto</surname>
          </string-name>
          ,
          <string-name>
            <surname>U.</surname>
          </string-name>
          :
          <article-title>Legibilidad de las páginas web sobre salud dirigidas a pacientes y lectores de la población general</article-title>
          .
          <source>Revista española de salud pública 76(4)</source>
          ,
          <fpage>321</fpage>
          -
          <lpage>331</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Broda</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niton</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gruszczynski</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ogrodniczuk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Measuring readability of polish texts: Baseline experiments</article-title>
          .
          <source>In: LREC</source>
          . pp.
          <fpage>573</fpage>
          -
          <lpage>580</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Contreras</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Alonso</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Echenique</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daye-Contreras</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>The sol formulas for converting smog readability scores between health education materials written in spanish, english, and french</article-title>
          .
          <source>Journal of health communication 4(1)</source>
          ,
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Crawford</surname>
            ,
            <given-names>A.N.:</given-names>
          </string-name>
          <article-title>A spanish language fry-type readability procedure: Elementary level</article-title>
          .
          <source>bilingual education paper series</source>
          , vol.
          <volume>7</volume>
          , no.
          <issue>8</issue>
          . (
          <year>1984</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dale</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chall</surname>
            ,
            <given-names>J.S.:</given-names>
          </string-name>
          <article-title>A formula for predicting readability: Instructions</article-title>
          . Educational research bulletin pp.
          <fpage>37</fpage>
          -
          <lpage>54</lpage>
          (
          <year>1948</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Flesch</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A new readability yardstick</article-title>
          .
          <source>Journal of applied psychology 32(3)</source>
          ,
          <volume>221</volume>
          (
          <year>1948</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Franchina</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vacca</surname>
          </string-name>
          , R.:
          <article-title>Adaptation of flesh readability index on a bilingual text written by the same author both in italian and english languages</article-title>
          .
          <source>Linguaggi</source>
          <volume>3</volume>
          ,
          <fpage>47</fpage>
          -
          <lpage>49</lpage>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>García</given-names>
            <surname>López</surname>
          </string-name>
          , J.: Legibilidad de los folletos informativos.
          <source>Pharmaceutical Care España</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ),
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>de Granada Barrio-Cantalejo</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simón-Lorda</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melguizo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Escalona</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marijuán</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hernándo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , et al.:
          <article-title>Validación de la escala inflesz para evaluar la legibilidad de los textos dirigidos a pacientes (</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kandel</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moles</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : Application de lâA˘ Z´
          <article-title>indice de flesch à la langue française</article-title>
          .
          <source>Cahiers Etudes de Radio-Télévision</source>
          <volume>19</volume>
          ,
          <fpage>253</fpage>
          -
          <lpage>274</lpage>
          (
          <year>1958</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschugnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Specht</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection</article-title>
          . In: Cappellato,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.Y.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.)
          <article-title>Working Notes Papers of the CLEF 2018 Evaluation Labs</article-title>
          .
          <source>CEUR Workshop Proceedings, CLEF and CEUR-WS.org (Sep</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Lucisano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piemontese</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          :
          <article-title>Gulpease: una formula per la predizione della difficoltà dei testi in lingua italiana</article-title>
          .
          <source>Scuola e città</source>
          <volume>3</volume>
          (
          <issue>31</issue>
          ),
          <fpage>110</fpage>
          -
          <lpage>124</lpage>
          (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mc</surname>
            <given-names>Laughlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>G.H.</surname>
          </string-name>
          :
          <article-title>Smog grading-a new readability formula</article-title>
          .
          <source>Journal of reading 12(8)</source>
          ,
          <fpage>639</fpage>
          -
          <lpage>646</lpage>
          (
          <year>1969</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>MISTRÍK</surname>
          </string-name>
          , J.:
          <article-title>Meranie zrozumitel'nosti prehovoru</article-title>
          .
          <source>Slovenská recˇ 33</source>
          (
          <year>1968</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Muñoz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Legibilidad y variabilidad de los textos. Boletín de Investigación Educacional, Pontificia Universidad Católica de Chile,
          <volume>21 2</volume>
          ,
          <fpage>13</fpage>
          -
          <lpage>26</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Padró</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanilovsky</surname>
          </string-name>
          , E.:
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          .
          <source>In: LREC2012</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Pearl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>From bayesian networks to causal networks</article-title>
          .
          <source>In: Mathematical models for handling partial knowledge in artificial intelligence</source>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>182</lpage>
          . Springer (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of machine learning research 12(Oct)</source>
          ,
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</article-title>
          . In: Kanoulas,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Toms</surname>
          </string-name>
          , E. (eds.)
          <article-title>Information Access Evaluation meets Multilinguality, Multimodality, and Visualization</article-title>
          .
          <source>5th International Conference of the CLEF Initiative (CLEF 14)</source>
          . pp.
          <fpage>268</fpage>
          -
          <lpage>299</lpage>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Ramírez-Puerta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernández-Fernández</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frías-Pareja</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuste-Ossorio</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narbona-Galdó</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peñas-Maldonado</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Análisis de legibilidad de consentimientos informados en cuidados intensivos</article-title>
          .
          <source>Medicina Intensiva</source>
          <volume>37</volume>
          (
          <issue>8</issue>
          ),
          <fpage>503</fpage>
          -
          <lpage>509</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Rodríguez</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          : Determinación de la comprensibilidad de materiales de lectura por medio de variables lingüísticas.
          <source>Lectura y vida 1</source>
          (
          <issue>1</issue>
          ),
          <fpage>29</fpage>
          -
          <lpage>32</lpage>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Saggion</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Štajner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bott</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mille</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rello</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drndarevic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Making it simplext: Implementation and evaluation of a text simplification system for spanish</article-title>
          .
          <source>ACM Transactions on Accessible Computing (TACCESS) 6</source>
          (
          <issue>4</issue>
          ),
          <volume>14</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Senter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <source>Automated readability index. Tech. rep.</source>
          ,
          <source>CINCINNATI UNIV OH</source>
          (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Spaulding</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A spanish readability formula</article-title>
          .
          <source>The Modern Language Journal</source>
          <volume>40</volume>
          (
          <issue>8</issue>
          ),
          <fpage>433</fpage>
          -
          <lpage>441</lpage>
          (
          <year>1956</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tschuggnall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kestemont</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfuscation</article-title>
          . In: Bellot,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Mothe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Murtagh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Soulier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Sanjuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Ferro</surname>
          </string-name>
          , N. (eds.)
          <string-name>
            <surname>Experimental IR Meets Multilinguality</surname>
          </string-name>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          .
          <source>9th International Conference of the CLEF Initiative (CLEF 18)</source>
          . Springer, Berlin Heidelberg New York (
          <year>Sep 2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>