Anomaly Searching in Text Sequences

                      Abdulwahed Almarimi, Gabriela Andrejková, Asmaa Salem
                      abdoalmarimi@gmail.com, gabriela.andrejkova@upjs.sk,
                                  asmamostafa.salem@gmail.com

                                              Pavol Jozef Šafárik University
                                                    Faculty of Science
                                                     Košice, Slovakia


                                                        Abstract
                       An analysis of some long text if authors are unknown or if it was written
                       by one author is still an interesting problem and it could be done using
                       methods of data analysis and data mining, and using structural analy-
                       sis. In the paper, it is described a system of modified S elf-Organizing
                       Maps working on probabilistic sequences built from a text. The se-
                       quences were built on letters and on words as n-grams, 1 ≤ n ≤ 4. The
                       system is trained to input sequences and after the training it deter-
                       mines text parts with anomalies (some different characteristics) using
                       a cumulative error and a complex analysis. In tested long texts the
                       system was successful, it covered a composition of texts.


1    Introduction
Each text is written in some genre, in some language and uses some grammar. It presents a sequence of letters,
sequence of words, sentences, sections. We find some anomalies in the text, it means, we find text parts with
some different characteristics based on n-grams in comparison to the full text or to the rest parts of the text.
For example, anomalies should show if the text was composed from two or more parts written by different
authors or if somebody manipulated with the text. The problem belongs to problems working with text and
studying authorship attribution [Neme2015], [Stamatatos2010], and plagiarism [Eissen2006], [Stamatatos2006],
and authorship verification [Hassan2012], but in both problems there exist some groups of comparable authors
and comparable texts. It means that the results of analysis can be compared according to texts or according to
authors. In our problem any author is known and we analyze each text as one extra text [Almarimi2015]. In the
solution of the problem, we use Self-Organizing Maps (SOM) models of neural networks [Kohonen2007]. A very
good description of Self-Organizing Maps extensions for temporal structures is in [Hammer2005], some of the
extensions are usable for sequences. SOM models of neural networks were applied to time series in [Barreto2009]
and they inspired for us to use it in a text analysis. In the text analysis we used English recommended texts
from benchmark [CorE2011] and Arabic texts from [CorA2011]. In this contribution, we developed a new system
for anomaly detections based on SOM model neural network and we applied it to Arabic and English long texts.
The system covers anomalies in texts consisting of two texts written by different authors.
   The paper has the following structure: The second section contains an information about used texts. In the
third section, it is given a theoretical background for a construction of learning sequences (based on symbols

Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
In: E. Vatai (ed.): Proceedings of the 11th Joint Conference on Mathematics and Computer Science, Eger, Hungary, 20th – 22nd of
May, 2016, published at http://ceur-ws.org


                                                              1
and on words). In the fourth section, we describe our new developed system to anomaly detections. The fifth
section contains an evaluation and illustration of results on English and Arabic texts. In the conclusion we give
a resume of our work and some plan of further work in the area.

2     Sequences of Symbols, Sequences of Words
2.1    Theoretical Background for Sequence Construction
The applied method to an analysis of word or letter sequences is a probabilistic model. The working sequences
will be presented by probabilities of their occurrences. It is possible to use individual relative frequencies of
words or letters. But some of word combinations have a higher probability, some of them are used only once. It
means, we will use a conditional probability of a word given by the previous words.
Notation:


    • Γ - a finite alphabet of symbols; |Γ| is the number of symbols in Γ; in our texts, ΓA will be Arabic and ΓE
      English alphabet;

    • V - a finite vocabulary of words in a given text (the alphabet Γ) presented in the alphabetic order; |V | -
      the numbers of different words in the vocabulary V ;

    • dN                                                                                                             N
       1 = hd1 , . . . , dN i; di ∈ V - the text as a finite sequence of words; N - the number of words in the text d1 ;


    • sM
       1 = hs1 s2 . . . sM i; si ∈ Γ; - the text as a finite sequence of symbols; M – the number of symbols in the
      text sM
            1 (including spaces);


    • The probability of a complete sequence of words dN
                                                       1 can be represented by Pd (d1 , d2 , . . . , dN ) if each word
      is supposed as an independent event.

    • The probability of a complete sequence of symbols sN
                                                         1 can be represented by Ps (s1 , s2 , . . . , sM ) if each
      symbol is supposed as an independent event.

   According to [Jurafsky2000] the probabilities Pd and Ps can be decomposed using conditional probabilities in
the following way:

          Pd (dN ) = P (d1 )P (d2 |d1 )P (d3 |d21 ) . . . P (dN |d1N −1 ) = ΠN           i−1         i−1
                                                                             i=1 P (di |d1 ), where d1   = hd1 . . . di−1 i   (1)

                                                                     −1
           Ps (sN ) = P (s1 )P (s2 |s1 )P (s3 |s21 ) . . . P (sN |sM
                                                                   1    ) = ΠM           i−1         i−1
                                                                             i=1 P (si |s1 ), where s1   = hs1 . . . si−1 i   (2)

The problem to compute the probabilities is solved using some simplification: an approximation of the probability
of the word (symbol) given by all the previous words (symbols). The probability is depending on the probability
of all preceding words but we will use only small number of them, for example 2, 3, 4. It means, we work with
2, 3, 4-grams.
    The n-gram model approximates the probability of a word di , 1 ≤ i ≤ N (symbol si , 1 ≤ i ≤ M ) using all the
previous probabilities of words Pd (di |di−n+1
                                          1      ) (symbols Ps (si |s1i−n+1 )) by the conditional probability of preceding
words Pd (di |di−1 ) (symbols Ps (di |di−1 )). It means,

                Pd (di |di−1            i−1
                         1 ) ≈ Pd (di |di−n+1 ), 1 ≤ i ≤ N,           Ps (si |si−1            i−1
                                                                               1 ) ≈ Ps (si |si−n+1 ), 1 ≤ i ≤ M,             (3)

    We have a formula for the general case of n-gram parameter estimation:

                                   C(di−1
                                      i−n+1 di )                                          C(si−1
                                                                                             i−n+1 si )
              Pd (di |di−1
                       i−n+1 ) ≈                   , 1 ≤ i ≤ N,                i−1
                                                                      Ps (si |si−n+1 )≈                   , 1 ≤ i ≤ M,        (4)
                                    C(di−1
                                       i−n+1 )                                            C(si−1
                                                                                             i−n+1 )


where C(di−1                                 i               i−1                                        i−1
         i−n+1 di ) is the count of n-grams di−n+1 di and C(di−n+1 ) is the count of all (n − 1)-grams di−n+1 .


                                                                  2
2.2     Sequences of symbols/words and n-gram of symbols/words
The text is a sequence of symbols in the alphabet Γ. We will map the given text to the sequence of occurrences
computed from the text D.

    1. The sequence Ss of symbol probabilities:

         • C(si ) – the number of occurrences si in the text D;
         • The sequence Ss is prepared by the formula:

                                                                  C(si )
                                                     fs (si ) =          ; 1 ≤ i ≤ M.                                       (5)
                                                                   M

    2. The sequence Ssn of n-gram of symbol probabilities:

         • The substring si si+1 si+2 . . . si+n−1 , i = 1, . . . M − n + 1, n = 1, 2, . . . is called n-gram of symbols; n -
           the length of the n-gram;
         • Cn (n g) – the number of occurrences of n-gram n g in the text D;
         • Cn−1 ((n−1) g) – the number of occurrences of (n − 1)-grams (n−1) g in the text D;
         • The sequence Ssn contains the frequencies of n g in the text D

                                                                       Cn (n g)
                                                       fsn (n g) =                .                                         (6)
                                                                     Cn−1 (n−1 g)

    3. The sequence Sw of word probabilities:

         • C(di ) – the number of occurrences di in the text D;
         • The sequence Sw is prepared using the following formula:

                                                                  C(di )
                                                     fw (di ) =          ; 1 ≤ i ≤ N.                                       (7)
                                                                   N

    4. The sequence Swn of n-gram of word probabilities:

         • The subsequence di di+1 di+2 . . . di+n−1 , i = 1, . . . N − n + 1, n = 1, 2, . . . is called n-gram of words;
         • Cnw (n g) – the number of occurrences of n-gram n g in the text D;
            w
         • Cn−1 ((n−1) g) – the number of occurrences of (n − 1)-grams (n−1) g in the text D;
         • The sequence Swn is prepared by formula:

                                                                      Cnw (n g)
                                                       fwn (n g) =    w           .                                         (8)
                                                                     Cn−1 (n−1 g)

3     System for Anomaly Detections
3.1     Self Organizing Maps
The Self Organizing Map belongs to the class of unsupervised and competitive learning algorithms [Kohonen2007].
This type of neural network is used to map the n-dimensional space to the lower-dimensional space, usually to
two-dimensional space. The neurons are arranged usually to the two dimensional lattice, frequently called a
map. This mapping is topology safe and each neuron has its own n-dimensional weights vector to an input. If
input is represented by some sequence (for example, time series), when the order of values is important, then it
is necessary to follow the order and do not change it.

The steps of the algorithm:

    1. Initialization. The weight vectors of each node (neuron) in the lattice are initialized to a small random
       value from the interval h0, 1i. The weight vectors are of the same dimensions as the input vectors.


                                                              3
 2. Winner identification for an input vector. Calculate the distance of the input vector to the weight
    vector of each node. The node with the shortest distance is the winner. If there are more than one node with
    the same distance, then the winning node is chosen randomly among the nodes with the shortest distance.
    The winning node is called the Best Matching Unit (BMU). Let i∗ be index of the winning node.
 3. Neighbors calculation. The following equation is used:

                                                                 ||ri (t) − ri∗ (t)|| 
                                          h(i∗ , i, t) = exp −                          ,                         (9)
                                                                        σ 2 (t)

      where σ(t) means the radius of the neighborhood function, t is an iteration step, ri (t) and ri∗ (t) are the
      coordinates of units i and i∗ in the output array.
 4. Weights adaptation. Only of the weights of the nodes within the neighborhood radius will be adapted
    using the equation:
                                        ~ old + η ∗ h (i∗ , i, t) ∗ ~x − w
                                ~ new = w                                ~ old ,
                                                                              
                                w                                                                  (10)
      where w~ new is the vector of the new weights, w
                                                     ~ old are old weights, η ∈ (0, 1) is the learning rate, ~x is the
      actual input vector.
      After the algorithm make changes in the weights, it presents the next input vector from the remaining input
      vectors to input and continues with the step 2 and so on until there is no input vector left.

3.2    Description of the system structure
In the first layer, it has SOMx , x ∈ {words, w2-grams, w3-grams, s3-grams} neural networks they are trained
to different sequences built according to the text T . The shape of the model is very similar to model developed
in [Almarimi2016], but here the different analysed sequences are used for a training and an evaluation. The
training of each SOMx is done on sequences Sw , Sw2 , Sw3 , Ss3 .


Figure 1: System for anomaly detections.The layer of SOM neural networks is trained to different sequences
built to the same text. The layer of Errors blocks evaluates a quality of training. On the results of the quality
is based a cumulative error and SOM winner evaluation.


3.3    Description of the system computation
After the SOMx was trained it is possible to evaluate how good was the training prepared by an evaluation of
errors for all input vectors (all windows in the text). We will use a quantization error Erx defined by (11) as a


                                                          4
measure of proximity input vector x+ to the learned winner vector wi∗ of i∗ -th neuron (winner for input vector
x+ ) in the SOMx .
                                        Erx (x+ , wi∗ ) = ||x+ − wi∗ ||,                                   (11)
Using formula (11) it is possible to compute the vectors of quantization errors

                                             {Erx (x+ (t), wi∗ (t)}R
                                                                   t=1 ,                                      (12)

where R is the number of training vectors, t is the order of the member in input sequence. For the anomaly
detections we will use thresholds developed by [Barreto2009]. Let α be a significance level (α = 0.01 or α = 0.05).
We suppose the percentage of normal values of the quantization error will be 100 ∗ (1 − α). Let Nα be the real
number such that a percentage 100 ∗ (1 − α) of the error values is less than or equal to Nα . Then

  • Lower limit: λ− = N1−α/2

  • Upper limit: λ+ = Nα/2

The important interval is hλ− , λ+ i, the values out of it could be detected as anomalies.
  The quantization vectors Erx and intervals hλ−                −
                                                          Sx , λSx i are computed in the panels ErrorSx , x ∈
{words, w2-grams, w3-grams, s3-grams} and they are used in two the following evaluations:

  • Cumulative Error
                               CEr = α1 ∗ Erw + α2 ∗ Erw2 + α3 ∗ Erw3 + α4 ∗ Ers3 ,                           (13)
                              P4
    where αi , i = 1, 2, 3, 4, i=1 αi = 1 are parameters for a contribution of Eri to the cumulative error. The
    values of the parameters αi should be chosen after the analysis of all errors. If the cumulative error has
    higher value than the threshold hup given by formula (14)


                                 hup = α1 ∗ λ+          +           +           +
                                             Sw + α2 ∗ λSw2 + α3 ∗ λSw3 + α4 ∗ λSs3                           (14)

    then the text needs some next analysis.


Figure 2: The layer of errors. Evaluation of the cumulative error for English text E777. The text E777 has 598
different words. The number of training vectors was 208, the number of iterations is 200.


                                                          5
    • Evaluation of text parts
       The text T will be divided into r, r > 1 disjunctive parts, T = T1 T2 . . . Tr . For each text part Tk , 1 ≤ k ≤ r
       the following evaluation will be done: T−k = T1 . . . Tk−1 Tk+1 . . . Tr will be used as a training text and Tk
       will be a testing text. After the training using T−k in our system, the testing text Tk will be evaluated using
       the quantization vector (12) and intervals hλ− , λ+ i, the interval is built on training data. The percentage of
       values x, x ∈ hλ− , λ+ i express how the text Tk is similar to the training text. Evaluation of four sequences
       constructed from the texts gives more complex view to the text.

4      Evaluation
4.1     Data Preparation
In the text analysis we use English recommended texts from benchmark [CorE2011] and Arabic texts from
[CorA2011]. In Table I we describe some information about 3 Arabic and 3 English texts, the texts A14 and
E777 were constructed as a combination of two different texts (important for an illustration of our analysis).
The letter position of the connection in both texts is shown in the thresholds of the analysis.

Table 1: Statistics of 3 English (E5, E14, E777) and 3 Arabic (A1, A4, A14) texts, the number of words by
length for 1 − 10 and maximal frequencies of 2 and 3 − grams for words and symbols.

                                                                Name of Texts
                             A1                 A4                A14           E5                    E14          E777
       Total-words          94197              31656              2168        93085                  40721          1655
      Total-symbols        395065             135573             11409        417899                216677         10407
      # diff. words         14110              10098              1108        12929                  7492            598
        # words
        by length
            1                 6                  81                    28             4018            1466           42
            2               13217               4816                  312            14663            5636         252*
            3              23287               7795                  529            19875            9041          296
                           24.62%             24.28%                24.40%          21.35%          24.91%        17.88%
            4              22426*              6324*                 456*           18520*           7022*          228
                           23.80%             19.97%                18.07%          19.89%          19.89%
            5               15887               5417                  342            10109            4032          152
            6               9159                3364                  249             6927            3397          136
            7               5559                2004                  144             6461            2425          141
            8               1978                802                    41             3717            1695          122
            9                921                349                    29             2538            1119          107
           10                336                182                    9              1831             762           91
      Max frq. sym             È@                È@                   È@               he             he             th
         Latin                al                 al                   al
        2-grams             22128               6545                 474             11206           5337           218
      Max frq. sym             ÕË@               ÈAK                  ÕË@             the             the           the
         Latin               alm                 nal                 alm
        3-grams              3027               3027                  57              6313           3276           127
       Max frq. w          éJÊ« , é<Ë@        øQK , B @        ÕÎð , éJÊ«         [in the]        [of the]      [of the]
         Latin           [allah,alyah]        [ala,tra]      [alyah,wasallm]
        2-grams               747                115                10                350             323           194
       Max frq. w          é<Ë@ úÎ            øQK B @             é<Ë@ úÎ      [it would be]   [one of the]      [the
                              éJÊ«               à@                    éJÊ«
         Latin        [salla,allah,alyah]   [ala,tra,ann]   [salla,allah,alyah]                                  number of]
        3-grams               745                 40                 10                37             168           10


                                                                6
4.2   Text E777, the cumulative error and text parts evaluation
Cumulative error. Figure 2 shows the analysis of the cumulative error. The experiment was done with
parameters α1 , α2 , α3 , α4 following the mean of each error AveErj , j = 1, 2, 3, 4. They were prepared according
to the formula
                                                            X4
                                             αi = AveEri /     AveErj .                                        (15)
                                                                j=1

The used values were α1 = 0.5614, α2 = 0.0440, α3 = 0.0087, α4 = 0.3859. The influence of the word probability
in windows and 3-grams of symbols probability in the same window is higher than the others. In the text E777
there exist some windows with the higher cumulative error than the threshold hup defined by (14), the text
needs some further analysis. It should have some anomalous features.

   Text part evaluation. The text E777 was divided into 4 parts with the same lengths. Evaluation of their
text parts is given in the Table 2. According to the similarity percentage in the columns ”2-gram of words” and
”3-grams of words” each part of text is not similar to the rest of the text E777 (the values are lower than 50%).
The result confirms that the text is combined from two different texts.

Table 2: The percentage of a text parts similarity in English text E777. Each value present the similarity of
testing part to training parts.

                                                                    Input Sequences
       Training Parts       Testing Part     words     2-grams of words 3-grams of words   3-grams of symbols
        {T2 , T3 , T4 }          T1        65.9574 %      48.9362 %          8.5106 %           51.0638 %
        {T1 , T3 , T4 }          T2        62.0000 %      46.0000 %         49.0000 %           78.0000 %
        {T1 , T2 , T4 }          T3        76.4706 %      29.4118 %         29.4118 %           74.5098 %
        {T1 , T2 , T3 }          T4        83.3333 %      35.1852 %          3.7037 %           61.1111 %


4.3   Text A14, the cumulative error and text parts evaluation
Cumulative error. Figure 3, the first part shows the analysis of the cumulative error of the text A14. The
experiment was done with parameters α1 = 0.3339, α2 = 0.0314.α3 = 0.0157, α4 = 0.6189, computed according
to the formula (15). For all types of errors, there exist error values above the thresholds and for the threshold of
the cumulative error too.The text should have some anomalous features.
   Text part evaluation. The text A14 was divided into 4 parts with the same lengths. Evaluation of its text
parts is given in the Table 3. The the similarity percentage in the columns ”2-gram of words” and ”3-grams of
words” except the part T4 is lower than 50%, it means the text A14 is not consistent text. The result confirms
that the text is combined from two different texts.

                          Table 3: The percentage of a text parts similarity in Arabic text A14.

                                                                    Input Sequences
       Training Parts       Testing Part     words     2-grams of words 3-grams of words   3-grams of symbols
        {T2 , T3 , T4 }          T1        54.7170 %      43.3962 %         43.3962 %           45.2830 %
        {T1 , T3 , T4 }          T2        77.9661 %      52.5424 %         32.2034 %           79.6610 %
        {T1 , T2 , T4 }          T3        70.3704 %      18.5185 %         22.2222 %           75.9259 %
        {T1 , T2 , T3 }          T4        54.7170 %      71.4286 %         67.8571 %           78.5714 %


4.4   Results of 40 Arabic and 40 English texts
The main goal of the system is to find some anomalies in the given long text. It means, our developed system
have to be trained on some parts of the text and tested on the other parts. The system was evaluated for 40
Arabic texts [CorA2011] and 40 English texts [CorE2011]. In Table I we show some information about some of
used texts.


                                                            7
                         Figure 3: Evaluation of the cumulative error for Arabic text A14.
   The strategy in the evaluation - to split each text into four parts, three of them were used for training and
the fourth for testing. We had four possibilities for each text. The training and testing of each text was done
for different input sequences to the system: words, 2-gram of words, 3-gram of words, 3-gram of symbols. The
results of each text were analyzed according to criteria [Almarimi2016]:

    • Good text – if the the values of percentage was higher than 75.000%.

    • Critical text – if the the values of percentage was less than 75.000%. The critical text means that the tested
      part of the text gives under critical values.

   Each text was evaluated 16 times, four times for each of the following methods: words, 2-gram of words,
3-gram of words, 3-gram of symbols. Results of the methods were combined into the last classification of each
text (good or critical). In the Table 4, we show the information about 40 Arabic and 40 English texts.

Table 4: Information about Results for 40 Arabic and 40 English texts according to the used evaluation method

                                           Good                                            Critical
                        {A1, A2, A3, A4, A5, A6, A8, A9, A10, A12,                {A7, A11, A15, A21, A25,
    Arabic Texts       A13, A14, A16, A17, A18, A19, A20, A22, A23,        62.5   A27, A28, A30, A31, A32,   37.5
                               A24, A26, A29, A33, A37, A40}                %     A34, A35, A36, A38, A39}    %
                     {E1, E2, E3, E4, E5, E6, E7, E8, E9, E10, E11, E12,               {E16, E17, E18,
    English Texts     E13, E14, E15, E19, E20, E21, E23, E24, E26, E29,    80           E22, E25, E27,        20
                     E30, E32, E33, E34, E35, E36, E37, E38, E39, E40 }    %              E28, E31}           %


5     Conclusion
In the paper we developed the system for the detection of anomaly parts in some text. The system was tested
on 40 English and 40 Arabic texts and it is capable to cover the text with anomaly if the text is a composition
of two texts. The results for English texts are better than for Arabic texts. Still it is necessary to do more
statistic tests to the evaluation of the developed system and to find a better settings of parameters for Arabic
texts mainly. It is possible to do it in experiments with parameters or to understand differences in the structure
of both languages.


                                                          8
Acknowledgment
The research is supported by the Slovak Scientific Grant Agency VEGA, Grant No. 1/0142/15.
We thank to Bc. Peter Sedmák from Pavol Jozef Šafárik University in Košice for his help in a programming in
Java.

References
[Almarimi2015] A. Almarimi and G. Andrejková. Document verification using n-grams and histograms of words.
        IEEE 13th International Scientific Conference on Informatics l (2322-5157), pp. 21-26, 2015.

[Almarimi2016] A. Almarimi, G. Andrejková and P. Sedmák. Self Organizing Maps in Text Anomalies Detec-
        tions. Proceedings of the conference Cognition and Artificial Life, Telč, 2016.
[Almarimi2016] A. Almarimi: Dissimilarities Detections in Arabic and English Texts Using n-grams, Histograms
        and Self Organizing Maps. Pavol Jozef Śafarik University in Kośice, PhD Thesis, 2016, 10, pp. 1-111.

[Barreto2009] G.A. Barreto and L. Aguayo. Time series clustering for anomaly detection: Using competitive
         neural networks. Proceedings WSOM 2009 LNCS(5629), 28-36, 2009.
[Bensalem2013] I. Bensalem, P. Rosso and S.Chikhi. A new corpus for the evaluation of Arabic intrinsic plagia-
         rism detection. CLEF 2013, LNCS 8138 pp. 53-58, 2013.

[Chomsky2005] N. Chomsky. Three factors in language design. Linguistic Inquiry 36(1), 1-22, 2005.
[Durgin2005] N.A. Durgin and P. Zhang.       Profile-based adaptive anomaly detection for network security.
         SAND2005 (7293), 1-44, 2005.
[Eissen2006] S.M. Zu Eissen, B. Stein and M. Kulig. Plagiarism detection without reference collections. In:
         Decker, R., Lenz, H.J. (eds.) GfKl. Studies in Classification, Data Analysis, and Knowledge Organiza-
         tion, Springer Berlin pp. 359-366, 2006.
[Hassan2012] F. I. H. Hassan and M. A. Chaurasia. N-gram based text author verification. IPCSIT, IACSIT
        press, 36:67-71, Singapore, 2012.
[Jurafsky2000] D. Jurafsky and J. H. Martin. Speech and Language Processing. Prentice-Hall, 1 ed., 2000.

[CorA2011] CorpusArabic. King saud university corpus of classical Arabic. http://ksucorpus.ksu.edu.sa, 2011.
[CorE2011] CorpusEnglish. pan-plagiarism-corpus-2011.          http://www.uniweimar.de/en/media/chairs/
        webis/corpora/pan-pc-11/, 2011.
[Hammer2005] B. Hammer, A. Micheli, N. Neubauer, A. Sperduti, and M. Strickert. Self-organizing maps for
       time series. WSOM 2005, Paris, pp. 1-8, 2005.
[Kohonen2007] T. Kohonen. Self Organizing Maps. Prentice-Hall, 2 ed., 2007.
[Neme2015] A. Neme, J .R. Pulido, A. Muňoz, S. Hernández and T. Dey. Stylistics analysis and authorship
        attribution algorithms based on self-organizing maps. Neurocomputing pp. 147-159, 2015.

[Stamatatos2006] Stamatatos, E. Ensemble-based author identification using character n-grams. In Proceedings
        of the 3rd International Workshop on Text-based Information Retrieval 36, 41-46, 2006.
[Stamatatos2010] Stamatatos, E. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci.
        Technol pp. 538-556, 2010.


                                                      9