=Paper=
{{Paper
|id=Vol-2962/paper04
|storemode=property
|title=Anomaly Detection in Text Documents using HTM Networks	
|pdfUrl=https://ceur-ws.org/Vol-2962/paper04.pdf
|volume=Vol-2962
|authors=Zoltán Szoplák,Gabriela Andrejková
|dblpUrl=https://dblp.org/rec/conf/itat/SzoplakA21
}}
==Anomaly Detection in Text Documents using HTM Networks	==
<pdf width="1500px">https://ceur-ws.org/Vol-2962/paper04.pdf</pdf>
<pre>
                Anomaly detection in text documents using HTM networks

                                             Zoltán Szoplák, Gabriela Andrejková

                                          Institute of computer science, Faculty of Science
                                                   P. J. Šafárik University in Košice
                                                  Jesenná 5, 04001 Košice, Slovakia
                                               zoltan.szoplak@student.upjs.sk
                                                 gabriela.andrejkova@upjs.sk

Abstract: Anomalies in texts can be caused mainly by                  time, it is sufficient to detect the first anomaly. For rec-
various interventions in texts, such as, by supplementing             ommender systems, it is often enough to point out that a
parts of a text from different authors. Such a type of                text does contain anomalous parts and leave the rest of it to
anomalies can disrupt text that would otherwise be consis-            manual analysis. Such an approach regarding the dataset
tent. In order to find anomalies we have combined multi-              in [1] has already been attempted in [2], where the task was
ple algorithms, including a non-traditional neural network            to merely determine whether a given text contained any
model - the Hierarchical Temporal Memory (HTM) net-                   anomalies. The second task is self-explanatory, we aim
work. HTM networks are spatiotemporal predictors based                to detect precisely the chunks of texts that are anomalous.
on the neocortex that combine the ability to retain mem-              It is important to note that while the algorithm presented
ories of time sequences like recurrent neural networks                does contain a measurement of the degree of anomalous-
with the spatial representations of convolutional neural              ness, we consider sections either fully anomalous or fully
networks.                                                             anomaly free, as the dataset suggests. The first problem
   To represent the text inputs for the HTM algortihm we              classifies a text as fully anomalous if there is even a single
use semantic folding, which encodes text differently than             anomaly present, while with the second class we will be
other embedding method: as a collection of contexts they              labelling individual sections as anomalous or not.
occur in. Alongside such a predictor we use numerous
other, more well known metrics, and combine them into a                  We proposed to experiment with streaming analysis de-
two step algorithm. In the first step we find the division            tection taking into account context and narrative progres-
points between the anomalous and non-anomalous parts.                 sion by analyzing texts sentence by sentence. To imple-
In the second step we determine which sections located                ment such analysis, multiple encoding methods were con-
between two division points are actually anomalous. The               sidered. Embedding methods such as Doc2Vec, described
algorithm was tested on 40 benchmark texts from PAN                   in [3] encode the exact word for word composition of
plagiarism corpus PAN-PC-11 and the results of determin-              the sentence. While useful for many tasks, we wanted a
ing whether the text contains anomalies or not are 100 %.             method that would be able to extract and compare the con-
The percentage of fully detected anomalies is 70.15 %.                text and topic of a sentence, rather than it’s exact word-
Keywords: HTM network; Semantic folding; Text                         ing. We therefore chose to use the Semantic Folding The-
anomaly detection                                                     ory, described in [4] and combined it with the HTM algo-
                                                                      rithm [5], which is a neural network designed specifically
                                                                      to work with the kinds of representations that the Seman-
1    Introduction                                                     tic Folding Theory creates. Since there can be types of
                                                                      anomalies that are not semantic ones or cases where a se-
Anomalies in texts are certain types of outliers which can            mantic change is not indicative of an anomaly, we have
be detected in parts of texts different from the rest of the          decided to implement more metrics that are designed to
text. Anomaly detection is therefore the task of identify-            capture syntactic and statistical information as well, sup-
ing such parts of text that deviate from the rest of the text         plementing the predictions made by the HTM network as
to a suspicious degree. In this paper, we are concerned               well as comparing their usefulness to the aforementioned
with creating a method capable of detecting anomalous or              method.
plagiarized sentences in English language texts.
   We have formulated two problems:
                                                                         The article is conceived as follows: In the second sec-
T1 Determining whether the text itself is anomalous (the              tion we discuss the current state of the art. In the third
   text contains an anomal part)                                      section we describe the "Semantic Folding (SF)" method.
                                                                      In Section 4, we provide a brief description of HTM net-
T2 Determining the number and location of anomalies in
                                                                      works. Section 5 is devoted to the description of our new
   text
                                                                      algorithm for finding anomalies. The results are processed
   Although solving the second problem provides a solu-               in the sixth section while the seventh section provides a
tion to the first, the first problem is solvable in a shorter         conclusion.
_____________________
Copyright ©2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
2   Related works                                                  The aforementioned research has inspired us to make
                                                                use of these networks. To satisfy the input reuqirements
                                                                of HTM networks, we needed to find a way to encode text
Anomalies can be considered a kind of outlier. General
                                                                data as Sparse Distributed Representations (SDRs).
methods for finding outliers and anomalies such as those
                                                                   In natural language processing (NLP), a method called
in Argavall [6] and Chandola [7] are also useful for finding
                                                                "Semantic Folding (SF)" [4] is important, which allows to
anomalies in texts, but texts are a special type of data for
                                                                store text data as sparse distributed representations.
which special methods can be used.
   Zhuang et al. [8] developed a generative model to iden-
tify frequent and characteristic semantic regions in the        3   Semantic Folding
word embedding space to represent the given corpus, and a
                                                                Semantic folding theory creates Sparse Distributed Repre-
robust outlierness measure which is resistant to noisy con-
                                                                sentations (SDRs) of text data called semantic fingerprints
tent in documents. Experiments conducted on two real-
                                                                to emulate the structure of the neocortex, the area of the
world textual data sets showed that the method can achieve
                                                                brain that is responsible for several high-level cognitive
very strong improvement over outlier ranking.
                                                                functions, such as vision, hearing, touch, movement or,
   In Kannan et al. [9] a matrix factorization method is pre-
                                                                most relevant to our case: language. A single fingerprint
sented, which is naturally able to distinguish the anomalies
                                                                ideally represents contexts and clusters of contexts that are
using low rank approximations of the underlying texts.
                                                                present in a given text. Such representation is achieved
   Young et al. [16] review significant deep learning re-       by first gathering a corpus representative of the text that
lated models and methods that have been used for numer-         we aim to encode, slicing it into snippets (sequences of
ous NLP tasks. They also summarize, compare and con-            words, usually paragraphs) and arranging them into a 2D
trast the various models and put forward a detailed under-      array using self-organizing maps. As a consequence, sim-
standing of the past, present and future of deep learning in    ilar snippets (those that share a lot of words) end up close
NLP.                                                            together, forming clusters.
   Recently, several articles have been published on the            After creating the array of our representation, we can
search for anomalies using HTM networks, described in           obtain the semantic fingerprint of a word by checking ev-
Hawkins et al. [10, 11]. Important publications on the use      ery single context of the array whether it contains the input
of HTM networks in finding anomalies include the paper          word or not. We set the given index to 1 in case the word
of Ahmad and Purdy [12]. They presented a novel HTM             is present in the snippets of the context and 0 otherwise.
based on-line sequence memory anomaly detection tech-           The result is a sparse vector since most words only occur
nique for time-series data. They demonstrated impressive        in a handful of contexts, therefore it is preferable to store
results from a live application that detects anomalies in fi-   only the indices that have a value of 1 to save memory.
nancial metrics in real time.                                       Since looking through every single context for each in-
   In another article Ahmad et al. [13], it is proposed a       put word is very time-consuming, it is preferable to sim-
novel anomaly detection algorithm that works on stream-         ply create a vocabulary of words from our corpus, calcu-
ing data. The technique is based on an online sequence          late the encodings for each word and store their represen-
memory algorithm based on HTM. They presented re-               tation in a database. If we want to encode collections or
sults using the Numenta Anomaly Benchmark (NAB), a              sequences of words, we simply take the fingerprint of each
benchmark containing real-world data streams with la-           individual word, concatenate the active bits for every index
beled anomalies.                                                and then activate only the indices where the number of ac-
   Cui et al. [14] presented a comparative study of HTM         tive bits in all the fingerprints exceeds a certain threshold.
networks, a neurally-inspired model, and other feedfor-             Merging the fingerprints in such a way allows us to re-
ward and recurrent artificial neural network models on          tain sparsity and prune the representation of all contexts
both artificial and real-world sequence prediction algo-        that may be relevant to the individual words, but are irrel-
rithms. They informed that HTM and long-short term              evant in the context they occur in. Such a representation
memory (LSTM) networks gave the best prediction accu-           does have a few perks of note. First, every individual bit in
racy. HTM has many other beneficial properties and fea-         our representation has its specific individual meaning, un-
tures that are desirable for real-world sequence learning.      like word encodings such as Bag-of-words or ASCII code.
   Hole [15] concentrated on understanding how the HTM          Furthermore comparing representations can be as easy as
learning algorithms can detect anomalies in complex adap-       calculating the number of overlapping active bits. We can
tive information and communications technology (ICT)            also take advantage of the fact that similar contexts are
systems. HTM finds anomalies in real-time streaming             clustered and compare representations using metrics re-
data. There is no need to store huge amounts of data            liant on geometry such as Euclidean and cosine distance.
since HTM builds models representing the properties of          While such representation does not take into account the
the data. They examined anomalies in Amazon Web Ser-            word order and thus is unsuitable for language generation,
vices (AWS) streaming data and then studied how HTM             it is very much suitable for topic matching and preventing
detects rogue human behavior.                                   semantic drift.
4     Hierarchical Temporal Memory

The human neocortex learns by recognizing patterns of
information sequences representing sensory "inputs" and
predicts following likely likely values based on previous
observations. Hierarchical Temporal Memory (HTM) is a
type of neural network that tries to reproduce the structure
and processes of the neocortex. More detailed description
of HTM is given in [5].


4.1    HTM Network Structure

A HTM network has an inherently bottom-up hierarchi-              Figure 2: SP - creating SDR of layer based on input [5].
cal structure, as seen in Figure 1, comprised of multiple
layers of 3-dimensional arrays of bits, referred to as cells.
Interconnected layers form a hierarchy where each layer is
                                                                   • Temporal Memory (TM) forms connections from the
connected to the one below it, with the one at the bottom
                                                                     cells active in the current step to cells that were active
connecting to the input of the network itself. A layer itself
                                                                     just prior and makes predictions. The algorithm uses
is a 3-dimensional array of bits comprised of columns ar-
                                                                     Hebb’s rule, where connections are formed between
ranged in a 2D array topology, where a column is made up
                                                                     cells that were previously active. Through the forma-
of a single or multiple cells. Each column is connected to a
                                                                     tion of those connections a sequence may be learned.
subset of the input/previous layer, and cells in a single col-
                                                                     The TM can then use its learned knowledge of the
umn are also connected to the cells above and below. The
                                                                     sequences to form predictions.
hierarchy of layers is inspired by biology, with the neocor-
tex consisting of multiple regions that either receive their
input directly from sensory organs or from other regions
connected to them.


                                                                 Figure 3: TM - representation of a layer after determining
                                                                 predictive cells (predictive darker cells, white non active
                                                                 cells, grey active cells) in TM. [5].


         Figure 1: Structure of HTM Network [5].                 4.2   Using HTM in our Algorithm
                                                                 The explicit mathematical description of computations for
  The learning algorithm of HTM networks can be di-              the HTM network can be found in Mnatzaganian et al.
vided into two steps [17]:                                       [17]. It describes all aspects of the spatial pooler, a crit-
                                                                 ical learning component in HTM, under a single unifying
    • Spatial Pooler (SP) creates a Sparse Distributed Rep-      framework. The primary learning mechanism is explored,
      resentation (SDR) based on the input and can be            where a maximum likelihood estimator for determining
      viewed as a mapping function from the input domain         the degree of permanence update is proposed.
      to a new feature domain where the meaning of the              The HTM algorithm, in essence, allows to take an array
      input is preserved while ensuring that the representa-     of bits as an input and to predict the input of subsequent
      tion in the feature domain remains sparse. The algo-       steps. Such predictions may be enough for some tasks,
      rithm is a type of unsupervised competitive learning       however, in order to detect anomalies, we need to extract
      algorithm that uses a form of vector quantization re-      the occurring differences in the patterns presented in the
      sembling self-organizing maps (SOMs).                      inputs.
   Let’s define vector xt as the input generated in our sys-                    stores the last k raw anomaly scores. In addition, we de-
tem at time t. Then the sequence of inputs to the algorithm                     fine the W 0 window with a size of j which is much smaller
can be defined as x1 , . . . , xt−1 , xt , xt−1 , . . . , xk , . . . possibly   than k that is used to calculate a small moving average
continuing until the detection task is stopped manually.                        of the last few anomaly scores. It makes a more accurate
   In general, the goal of streaming anomaly detection is                       comparison metric than a single score. We calculate the
to find abnormalities in inputs as soon as they occur. Such                     anomaly likelihood using the Q-function (the tail distri-
a real time constraint means that for the purposes of an                        bution function to the Gaussian normal distribution func-
anomaly detection at time t the inputs of earlier times                         tion), where the mean and variance of raw anomaly scores
(1, ...,t − 1) are accessible only, not the input at time t + 1                 for the normal distribution function are recalculated every
or later. The HTM algorithm is capable of detecting such                        time using the values of anomaly scores in our window-
anomalies as well by using a method described in [12].                          sized memory.
   Let’s define two variables to explain the HTM anomaly                           The anomaly likelihood metric at time t is defined as the
detection algorithm: Let a(xt ) be the sparse binary repre-                     complement of the tail probability:
sentation of the input vector xt , defined by the binary 2D                                                             
matrix of all cells within a region, where ai, j (xt ) is set to                                                µ̄t − µt
                                                                                                   Lt = 1 − Q                            (2)
1 if the ith cell of the jth column is in the active state and                                                      σt
to 0 otherwise. Let π(xt ) be the sparse binary representa-
                                                                                    where:
tion of the prediction of the next input - a(xt+1 ), defined
by the binary 2D matrix of all cells within a region, where                                          1
                                                                                                        Z ∞     2
                                                                                                                  u
πi, j (xt ) is set to 1 if the ith cell of the jth column is in the                          Q(x) = √ −     exp −   du                   (3)
                                                                                                     2π  x        2
predictive state and to 0 otherwise. The values of the pre-
diction matrix are greatly influenced not just by the input                                                                   0
                                                                                            ∑i=W −1 st−i              ∑i=W −1 st−i
itself, but the context as well.                                                        µt = i=0         ,       µ̄t = i=0               (4)
                                                                                                 k                         j
   Therefore the accuracy of the algorithm prediction is
largely dependent on its ability to model the data. These                                               ∑i=W −1           2
                                                                                                                (st−i − µt )
two variables are calculated at each step of the algorithm,                                     σt2 =    i=0
                                                                                                                                        (5)
                                                                                                               k−1
however, they do not contain sufficient information to find
anomalies by themselves. Instead, we use these variables                          Naturally, the greater the likelihood score the more
to compute a raw anomaly score for each timestamp, la-                          probable it is that we have found the offset of an anomaly.
belled st . The raw anomaly score essentially gives us a
measurement of a deviation between the predicted and the                        5     Anomaly Detection in Texts
actual input. The raw anomaly score is given by:
                                 π(xt−1 )a(xt )                                 The new developed system combines the previously de-
                          st =                                           (1)    scribed methods and works in two steps:
                                    |a(xt )|
   Both variables are binary vectors, the multiplication is                     1. Finding the locations of the text changes. The first
the inner product, divided by the size of the output. The                          step is to create an algorithm that can find the exact
less the prediction was correct the larger our anomaly                             locations where the style of the text changes, whether
score is. The value of st is therefore a scalar value between                      from non-plagiarized to plagiarized or vice-versa. We
0 and 1, 0 meaning the prediction was perfect, 1 meaning                           consider sentences the smallest unit of text to be ana-
nothing has been correctly predicted. A weak prediction                            lyzed this way as we do not expect sentences that have
would therefore be indicative of an anomaly, however it                            both anomalous and non-anomalous parts within them.
does not take into account the predictive capability of our                        Therefore finding the locations of text changes in prac-
network on the amount or noise present in the text.                                tical terms means finding the offsets of the first sen-
   To counteract this, we calculate the distribution of                            tence of the anomalous part and the first sentence of
anomaly scores within a certain time window, and there-                            the non-anomalous part.
fore find the likelihood instead of simply tresholding the                      2. Filtering out the non-anomalous potential se-
raw anomaly score. The anomaly likelihood metric is de-                            quences. The second step is to take the offsets of the
signed to measure a change in predictability, rather than                          first step and filter out the sections located between two
change in the input pattern and thus it accounts for the be-                       offsets that are not actually anomalous.
ginning of the text where predictability is low. The metric
is ideal for detecting not only the starting points of the
anomalies but their ending points as well (since in that                        5.1   Finding the Locations of Text Changes
case the predictability would suddenly get much more ac-                        We have created the following features for the purpose of
curate).                                                                        determining where such points of change lie:
   To calculate the anomaly likelihood metric, we use a
large moving window represented by the vector W that                            F1. The anomaly likelihood score of each sentence.
F2. The cosine similarity of the Doc2Vec vectors and             word is present (which might not be by itself indicative of
    their autoencoder predictions.                               an anomaly).
F3. The Euclidean distance between the current finger-              To calculate F9, we use the the mean relative frequency
    print and the fingerprint of preceding sentences.            of the author style metric described in [19]. We calculate
                                                                 the frequency of each word in the sentence as well as in the
F4. The cosine similarity between the current fingerprint        document. We average these values across all of the words
    and the fingerprint of preceding sentences.                  and compare the mean value of the sentence frequencies
F5. The Jaccard index between the current fingerprint            and the document frequencies to get the difference in au-
    and the fingerprint of preceding sentences.                  thor styles.
                                                                    We use a Gradient Boosting Classifer (GBC), described
F6. The average relational frequency of the words con-           in [20], to combine the predictive capability of the afore-
    tained within a sentence.                                    mentioned features as well as evaluate their relative im-
F7. The lowest relational frequency of the words con-            portance. The GBC is an ensemble of multiple models,
    tained within a sentence.                                    specifically decision trees, which has superior prediction
                                                                 capability compare to the individual models. It does so by
F8. The highest relational frequency of the words con-
                                                                 iteratively adding models to the combined model that min-
    tained within a sentence.
                                                                 imize the residual loss obtained by the combined model in
F9. The difference of the mean relational word frequency         the previous step.
    of the sentence and the text.                                   To label the dataset, we choose a rather simplistic
                                                                 method: We label all sentences as zeroes except the first
   To calculate F1, we use the semantic fingerprint method       sentence in each anomaly and the first sentence after the
to create a fingerprint of each sentence and use them as         anomaly ends. Our prediction of these division points is
inputs to the HTM network described in Section 3 and then        then used to make the prediction about where the potential
calculate the anomaly likelihood score.                          anomalous sections might be.
   To calculate F2, we use Doc2Vec, an embedding
method described in [3] to create a vector representation
of each sentence of our text. We then use it as inputs to        5.2   Filtering out Non-anomalous Potential Sequences
train an autoencoder to first create an encoding of lesser di-
mensionality and then reconstruct an output from the code        Now that we have our division points, we still need to iden-
that best matches the input. After training the autoencoder,     tify which sections lying between two points are anoma-
we calculate the cosine similarity between the input em-         lous and which are not. If we merely tagged them in an
bedding and the reconstructed output embedding. Since            alternating fashion, it would lead to a large number of er-
most of the text is expected to come from a single source        rors, as a single faulty division point could mean that we
with occasional anomalous parts inserted, the network,           misclassify our entire dataset. We need some way of de-
through training, creates generalized reconstructions that       termining the anomalous nature of individual sections. We
have more success reconstructing the inputs from the orig-       can assume that most anomalous sections are relatively
inal text than from the anomalous parts.                         short compared to non-anomalous sections. We can pair
   To calculate F3, F4 and F5, we use semantic finger-           up indices of division points that are located close to one
prints, comparing the fingerprint of each sentence to the        another, specifically 50 sentences from each other. While
merged fingerprint of the preceding 5 sentences in a win-        we can create possible pairings of anomalous parts that
dow (determined to be the average anomaly length). We            are located close to one another, there are individual divi-
use 3 different metrics of comparison: Euclidean distance,       sion points that cannot be paired. They might be a false
cosine similarity and Jaccard distance. The metrics are          positive or they corresponds to beginning or ending that
suited to detecting the first sentence of an anomalous sec-      has not been found yet. For each isolated index, we con-
tion as well as the first sentence of a non-anomalous sec-       struct multiple artificial sections that are created varying
tion that follows an anomalous section.                          distances before or after it. The exact process is described
   To calculate F6, F7 and F8, we use relational frequency       in algorithm 1.
metrics described in [18], which measure how specific a             Such pairings inevitably lead to false positives, there-
current word is to a given segment. We calculate the fre-        fore it is vital to devise a method that can recognize gen-
quency of the most and least frequent word as well as the        uine anomalous parts.
mean frequency of all words in a sentence. Anomalous                We use a gradient boosting regressor, to predict the
sentences are presumed to have low relational frequency          anomalousness of the potential section (the percentage of
scores due to having words atypical for the rest of the text.    anomalous sentences from the section).
A low mean relational frequency means that a sentence               There are multiple parameters that we use as inputs,
contains many unique words. A low highest relational             some from the previous step and others from modified al-
frequency means that unique stopwords are used. A low            gorithms that deal with sections instead of sentences. In-
lowest relational frequency means that a sentence unique         put values for the regressor are the following:
 Algorithm 1: Algorithm for creating potential                    Algorithm 2: Algorithm for merging overlapping
 anomalous sections from division points                          sections
  Data: P - Sorted list of division points, where the              Data: S - List of potential anomalous sections that
         value of each point is the index of the                          are defined by the index of their starting
         sentence within the document                                     sentence and the index of their last sentence
  Result: S - List of potential anomalous sections                 Result: S - List of potential anomalous sections
  i←1                                                                       without overlaps
  while i ≤ length(P) do                                           OverlappingSections ← T RUE
     if P[i] − P[i − 1] ≤ 50 then                                  while OverlappingSections do
          S.addSection(P[i − 1], P[i])                                 OverlappingSections ← FALSE
     else                                                              for i in range(0, length(S) − 1) do
          if P[i + 1] − P[i] ≤ 50 then                                     for j in range(i, length(S)) do
              S.addSection(P[i], P[i + 1])                                      if overlaps(S[i], S[ j]) then
          else                                                                      OverlappingSections ← T RUE
              for j in range(1,10) do                                                 f irst = min(S[i]. f irst, S[ j]. f irst)
                   S.addSection(P[i] − 5 × j, P[i])                                   last = max(S[i].last, S[ j].last)
                    S.addSection(P[i], P[i] + 5 × j)                                  S.addSection( f irst, last)
              end                                                                     S.removeSection(S[i])
          end                                                                         S.removeSection(S[ j])
     end                                                                        end
  end                                                                      end
                                                                       end
                                                                   end
    • The mean value of F2, F6-F9 for every sentence of
      the section (metrics using Semantic Folding are not
      used due to them having high values in using division      a collection of 4753 larger text bodies in the English lan-
      points).                                                   guage, made up of various topics, where some texts have
    • The Euclidean distance between the fingerprint of the      plagiarised parts artificially inserted into them. We have
      section and the fingerprint of the entire document.        chosen 40 texts from the corpus to test our system.
                                                                    For each text, we have two files: a.txt file containing
    • The cosine similarity between the fingerprint of the       the texts themselves and an.xml file containing various
      section and the fingerprint of the entire document.        metadata such as the source of the base text and the author
    • The Jaccard index between the fingerprint of the sec-      and most importantly: the list of anomalous parts defined
      tion and the fingerprint of the entire document.           by the division points and the length of such a part. The
    • The average relational frequency of the words con-         number of plagiarized parts in texts is not constant and can
      tained within the entire section.                          contain 0, with no plagiarism inserted, up to 10 anomalous
                                                                 parts. Anomalous parts are whole sentences or entire para-
    • The lowest relational frequency of the words con-          graphs. Thus we only consider entire sentences or larger
      tained within the entire section.                          collections of sentences as possible anomalies.
    • The highest relational frequency of the words con-            First we remove all stop words from the text, such as a,
      tained within the entire section.                          the, is, at, which, on etc., since these characters have little
    • The difference of the mean relational word frequency       relevance upon the meaning or authorial style of the text.
      of the section and the text.                               We solved two tasks as follows:
                                                                    Task 1: To find out if the suspicious text is anomalous:
   After training our regressor, we treshold the anomalous-      Here, we are only trying to determine whether a text con-
ness values to obtain the list of anomalous sections. Due        tains any anomalies or not, ignoring the location or the
to the artificial creation of sections, overlaps may be possi-   number of anomalies present.
ble. We describe a way to merge these section in algorithm          Task 2: To identify the individual anomalous parts
2.                                                               within the text correctly, taking into consideration their
                                                                 precise locations.
6     Results                                                       We first split the sentences using the NLTK (Natural
                                                                 Language Toolkit) tokenizer, described in [21], marking
6.1   Data preparation                                           certain words ending with a punctuation mark as excep-
                                                                 tions when splitting (such as st., mr., mrs., dr. etc.). We
We worked with the PAN intrinsic anomaly detection pla-          removed also all non-text characters, all stop words as
giarism corpus 2011 [1] as experimental text data. It is         well as words shorter than 3 letters. Finally, we also
merged sentences that consist of only a single word with           Features F3-F5 are calculated just as described in Sec-
the first non-single word sentence that comes after it. Such    tion 5, using the five sentences preceding the current sen-
a way of merging allows us to avoid a lot of false positives    tence to form our merged fingerprint and comparing it to
which would occur by having sentences that have very lit-       the fingerprint of the current sentence. Features F6-F9 are
tle meaning on their own. We also store which sentences         calculated just as described in Section 5, with calculating
they were merged from and apply the features calculated         the relational frequencies and author style separately for
to all of the sentences.                                        every sentence on a text that did not have short words or
   We then calculate the features for each sentence. For        stop words removed.
the merged fingerprints of the preceding sentences used            We used these features as the input parameters to our
for F1, F3, F4 and F5, we have chosen to merge the              Gradient Boosting Classifier with 200 estimators and a
fingerprints of five sentences that came before the cur-        maximum depth of 4. After training our classifier, we can
rent sentence. In case of the first sentence, we do not         get the positive examples of sentence offsets and construct
have a merged fingerprint, thus we use the current fin-         the potential sections from them. To filter them out, we
gerprint as the input to the Temporal Pooler as is. In a        train a Gradient Bossting regressor with 200 estimators
case where there are less than 5 sentences available we         and a depth of 4 to predict their relevance. We then thresh-
use a merged fingerprint consisting from sentences that are     old the prediction to get the potential anomalous parts.
available. Then, in case of F1, as was described, we create     We experimented with multiple thresholds and found the
a fingerprint where only those bits have a value of 1 that      threshold value of 0.7 as sufficient. Finally, we merge the
are active in both the merged print and the current print       potential sections and then compare them with the actual
for each sentence. The fingerprints arrays have a size of       anomalous sections using multiple metrics, such as pre-
128 × 128 and use the standard English associative dictio-      cision, recall and accuracy to measure our success. For
nary of cortical.io. As the fingerprints are encoded as         evaluation, we consider the anomalies as positive exam-
a collection of active bits, we have a list of sorted indices   ples and label every single character of the full unmodified
that are between 1 and 16384.                                   text, therefore taking into account the length of sentences
   To increase computational efficiency, we split the list      as well.
into four parts, each having a portion of the values in the
following way: the first list contains all of the values be-    6.3   Experimental results
tween 1 and 4096, the second between 4097 and 8192,
the third between 8193 and 12228 and the fourth between         We have evaluated the predictions of anomalies made by
12229 and 16384. We then create sparse vectors from             our model and organized the results into Table 1. For com-
these lists, by having 1 in an index that is present in the     parison, we have implemented the Author style algorithm
list and 0 everywhere else. Such a division does not pose       proposed by Kuznetsov et al. [19] and evaluated it on our
much of a problem for the prediction capabilities of HTM        dataset as well. The results have also been organized into
algorithm, as it has a feature that allows it to separate in-   Table 1, alongside the results of our own algorithm.
dividual patterns that represents a sequence of inputs that        The columns of the table are arranged in the following
still belongs to a single object.                               manner: The Txt column corresponds to the id number of
                                                                the text. The Plag det column tells us how many anomalies
                                                                did our algorithm detect fully out of the number of plagia-
6.2   HTM Network and Gradient Boosting Classifier              risms present. The T1 (Task 1) column tells us the con-
      Training                                                  clusion our algorithm reached when determining whether
                                                                the text is anomalous or not (N for non anomalous, Y for
We first fit the HTM network using the entire text once as      anomalous). Columns Pre, Rec and Acc refer respectively
the training set and then run the same inputs through the       to the precision, recall and accuracy values achieved by
network to get our predictions. We calculated the anomaly       our algorithm with classifying anomalies. Columns Pre
scores and from them the likelihood scores for each sen-        A2, Rec A2 and Acc A2 are the results achieved by [19]. If
tence of the document. For the moving window of mean            there are no predicted anomalous parts/sentences, the pre-
anomaly scores, we used a window size of 5 that stores the      cision metric has a values N/A since calculating the met-
previous 5 anomaly scores. For F2, we have used Doc2Vec         ric becomes meaningless. Likewise if there are no actual
to create 100 dimensional vectors from each sentence. We        anomalous parts/sentences present in the text, the recall
have chosen a five layered feed-forward perceptron as our       metric similarly has a value of N/A.
autoencoder network, with the layer sizes being (experi-           As we can see in Table 1, our algorithm has achieved
mentally chosen): 100, 50, 20, 50, 100 with the first and       an accuracy of 100 % in T1. Regarding T2, the overall
last layer being the same size as the Doc2Vec inputs. We        percentage of fully found plagiarisms is 70.15 %. The re-
trained the network on the text in 5 epochs (experimen-         sults show high accuracy due to the far greater number of
tally chosen), then performed the reconstruction for each       non-anomalous sections in the actual text as well as high
sentence and calculated the cosine similarity between the       precision due to the number of false positives being fur-
input and the reconstruction.                                   ther by filtration. The downside of such extensive filtering
can be seen in the recall values however, where the results
                                                                Table 1: Information about 40 English texts from corpus
vary. Such a discrepancy between the precision and re-
call values can be observed with both algorithms. When it       [1].
                                                                        Plag                             Pre    Rec    Acc
comes to accuracy, the results of the author style algorithm     Txt
                                                                         det
                                                                               T1    Pre   Rec    Acc
                                                                                                          A2     A2     A2
have only been better in case of text 11, equal in multiple       1     0/0    N    N/A    N/A     1.0   0.0    N/A    1.0
cases and inferior to ours in most.                               2     0/0    N    N/A    N/A     1.0   0.0    N/A    1.0
   When looking at the metrics of precision and recall, they      3     6/8    Y    0.86   0.64   0.98   0.73   0.66   0.97
tend to achieve a better, sometimes perfect precision score,      4     3/4    Y    0.97   0.44   0.98   0.70   0.15   0.96
but this usually comes at the cost of a significantly worse       5       0    N    N/A    N/A     1.0   0.0    N/A    0.98
recall and accuracy score. While precision is important,          6     4/5    Y    0.96   0.70   0.98   1.0    0.24   0.96
we believe that for recommender systems, it is better to          7    11/16   Y    0.96   0.57   0.94   0.97   0.32   0.90
                                                                  8       0    N    N/A    N/A     1.0   0.0    N/A    1.0
create a few false positives in order to flag most of the
                                                                  9     7/13   Y    0.97   0.74   0.94   0.85   0.67   0.91
anomalous sections, rather than being more sure with the
                                                                  10      0    N    N/A    N/A     1.0   0.0    N/A    0.99
anomalousness of fewer sections.                                  11    5/11   Y    0.97   0.50   0.90   0.91   0.57   0.91
   As for the performance of the author style algorithm           12      0    N    N/A    N/A     1.0   N/A    N/A    1.0
on T1, it failed to achieve 100 % accuracy, due to often          13    6/10   Y    0.97   0.73   0.93   1.0    0.07   0.77
predicting sentences to be anomalous when there are no            14    6/6    Y    0.85    1.0   0.99   1.0    0.02   0.96
anomalies to be found (such as texts 5 or 10 in Tale 1)           15      0    N    N/A    N/A     1.0   N/A    N/A    1.0
or not finding any anomalies despite the text containing          16    8/13   Y    0.87   0.60   0.90   0.75   0.05   0.78
them (such as texts 23 or 28). It appears that our results        17    2/2    Y    0.81    1.0   0.99   1.0    0.08   0.96
mostly surpass the results generated by the author style          18    4/4    Y    0.92    1.0   0.98   1.0    0.87   0.96
                                                                  19    6/9    Y    0.97   0.54   0.93   0.98   0.11   0.87
algortihm, therefore showing that sequential anomaly de-
                                                                  20   13/16   Y    0.91   0.81   0.97   0.89   0.73   0.95
tection is worth consideration.                                   21    8/12   Y    0.96   0.65   0.97   0.90   0.69   0.96
   However the disparity in precision is still not ideal, and     22    6/6    Y    0.83    1.0   0.99   0.88   0.27   0.96
lovering our threshold did not give us much improvement           23    6/7    Y    0.93   0.94   0.99   N/A    0.0    0.95
in recall in comparison to the decrease in precision. We          24    0/0    N    N/A    N/A     1.0   N/A    N/A    1.0
believe this is occuring because of short anomalous sec-          25    4/5    Y    0.87   0.98   0.99   1.0    0.21   0.97
tions, because our artificially constructed anomalous sec-        26    5/12   Y    0.98   0.52   0.88   0.94   0.53   0.88
tions are at least 5 sentences long, which might be more          27    0/0    N    N/A    N/A     1.0   N/A    N/A    1.0
than the minimum length of said anomalies. We can see             28    4/5    Y    0.69    1.0   0.98   N/A    0.0    0.95
                                                                  29    6/8    Y    0.88   0.72   0.99   1.0    0.07   0.96
that a lot more depends on the division point detection part
                                                                  30    3/4    Y    0.94   0.94   0.99   N/A    0.0    0.95
of the algorithm, therefore we have decided to plot the im-
                                                                  31    7/7    Y    0.94    1.0   0.99   1.0    0.18   0.90
portance of features for our classifier, showed in Figure 4.      32    1/3    Y    0.99   0.15   0.96   0.82   0.09   0.95
                                                                  33    0/0    N    N/A    N/A     1.0   N/A    0.0    1.0
                                                                  34    0/0    N    N/A    N/A     1.0   N/A    0.0    1.0
                                                                  35    2/6    Y    0.94   0.40   0.86   0.99   0.20   0.82
                                                                  36    0/0    N    N/A    N/A     1.0   N/A    N/A    1.0
                                                                  37    8/9    Y    0.88   0.97   0.98   1.0    0.07   0.87
                                                                  38    0/0    N    N/A    N/A     1.0   0.0    N/A    1.0
                                                                  39    0/0    N    N/A    N/A     1.0   N/A    N/A    1.0
                                                                  40    0/0    N    N/A    N/A     1.0   N/A    N/A    1.0


Figure 4: Feature importance of the gradient boosting clas-        The metrics that determine the most unique sentences
sifier                                                          instead of division points are also quite important in the
                                                                prediction as demonstrated by the usefulness of the cosine
                                                                similarity of the Doc2Vec vectors as well as the mean rela-
   As we can see, the most successful feature used was F9,      tional frequency of words. In case of the lowest frequency
the mean relational word frequency comparison between           metric, there may easily be sentence-unique words in sen-
the whole text and the sentence. The anomaly likelihood         tences that are from non-anomalous parts and sentences
is very close, being designed specifically to detect points     from anomalous parts that do not have sentence-unique
of change. We can see that using Semantic Folding with-         words. As for the highest frequency, while some anoma-
out a strong predictor like HTM decreases the performance       lous sentences might not have the same stop words as the
as evidenced by the relative uselessness of the fingerprint     rest of the sentences, others might, making it not all that
comparison metrics (Euclidean, Cosine and Jaccard).             reliable.
7   Conclusion                                                    [5] Hawkins, J., Ahmad, S., Purdy, S. and Lavin, A: Biological
                                                                      and Machine Intelligence (BAMI), http://numenta.com/
                                                                      business-strategy-and-ip/. Release 0.4, (2017)
We have developed a method capable of detecting intrin-
                                                                  [6] Aggarwal,       C. C.:         Outlier analysis. Springer
sic anomalies in natural text based on sequential analysis
                                                                      Science+Business       Media       New      York,      (2013).
of the syntactic and semantic patterns exhibited by poten-            https://doi.org/10.1007/978-3-319-47578-3
tially anomalous text. The algorithm consists of the two
                                                                  [7] Chandola, V., Banerjee, A. and Kumar, V.: Anomaly detec-
step process of identifying the location of the starting and          tion: A survey. ACM Comput. Surv., 41(3), (2009)
ending sentences of anomalies and determining whether a
                                                                  [8] Zhuang, H., Wang, Ch., Tao, F., Kaplan, L. and Han, J.:
section located between two such points is anomalous. For             Identifying Semantically Deviating Outlier Documents. Pro-
both of these steps we use gradient boosting to form a pre-           ceedings of the conference on Empirical Methods in Natural
diction out of various metrics extracted from the text using          Language Processing, pages 2748–2757, Copenhagen, Den-
the Semantic Folding algorithm, HTM network, Doc2Vec,                 mark, September 7–11, 2017. c 2017 Association for Com-
autoencoders and metrics based on word frequencies.                   putational Linguistics
   The algorithm was tested and evaluated on 40 English           [9] Kannan, R., Woo, H., Aggarwal, C. C. and Park, H.:
texts with artificially inserted plagiarisms from the PAN             Outlier Detection for Text Data: An Extended Version.
intrinsic anomaly detection plagiarism corpus 2011. Our               arXiv:1701.01325v1, (2017)
objective was twofold, the first, to determine whether the        [10] Hawkins, J. and Ahmad, S.: Why neurons have thousands
text contains any anomalies at all and the second, to de-             of synapses, a theory of sequence memory in neocortex.
termine the exact number and position of the anomalous                Frontiers in Neural Circuits 10(23), 1–13 (2016)
passages. The algorithm achieved an accuracy of Task 1            [11] Hawkins, J., Lewis, M., Klukas, M., Purdy, S., Ahmad, S.:
was 100 % and better-than-expected results in Task 2, de-             A framework for intelligence and cortical function based on
pending on the text with relatively high values of precision          grid cells in the neocortex. Frontiers in Neural Circuits, Vol.
and accuracy, but varying recall. The overall percentage of           12,
fully found plagiarisms within the text was 70.15 %. We           [12] Ahmad, S. F. and Purdy, S.: Real-Time Anomaly Detection
found that our algorithm was able to improve on the re-               for Streaming Analytics. arXiv:1607.02480v1, (2016)
sults of Kuznetsov et al. [19]. While there is much room          [13] Ahmad, S, Lavina, A., Purdy, S., Agha, Z.: Unsupervised
for improvement, we believe the paradigm of continuous                real-time anomaly detection for streaming data. Neurocom-
                                                                      puting, 267 (2017), 134-147
analysis in plagiarism detection to be an interesting and
valid one that provides non-dismissible results and might         [14] Cui, Y., Surpur, Ch., Ahmad, S., Hawkins, J.: A compar-
                                                                      ative study of HTM and other neural network models for
be a step in the right direction when solving similar prob-
                                                                      online sequence learning with streaming data, 2016 Interna-
lems.                                                                 tional Joint Conference on Neural Networks (IJCNN), 2016,
                                                                      pp. 1530-1538, doi: 10.1109/IJCNN.2016.772738
   Acknowledgements. The research is supported by                 [15] Hole, K. J.: Anomaly Detection with HTM. Chapter 12 in
the Slovak Scientific Grant Agency VEGA, Grant No.                    the book: Anti-fragile ICT Systems, Springer 2016
1/0177/21 “Descriptional and Computational Complexity             [16] Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent
of Automata and Algorithms”.                                          Trends in Deep Learning Based Natural Language Process-
                                                                      ing. 2018, arXiv:1708.02709v8 [cs.CL]
                                                                  [17] Mnatzaganian, J., Fokoué, E. and Kudithipudi, D.: A Math-
References                                                            ematical Formalization of Hierarchical Temporal Memory’s
                                                                      Spatial Pooler. arXiv:1601.06116v3, (2016)
[1] Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño,           [18] Oberreuter, G., L’Huillier, G., A. Ríos, S., D. Velásquez, J.:
    A. and Rosso, P.:        PAN Plagiarism Corpus 2011               Approaches for Intrinsic and External Plagiarism Detection,
    (PAN-PC-11), DOI 10.5281/zenodo.3250095, (2011)                   Notebook for PAN at CLEF 2011, (2011)
    http://www.uniweimar.de/en/media/chairs/webis/corpora/pan-    [19] Kuznetsov, M., Motrenko A., Kuznetsova, R. and Strijov,
    pc-11/                                                            V.: Methods for intrinsic plagiarism detection and author di-
[2] Almarimi, A., Andrejková, G., Salem, A.: Anomaly Search-          arization, Notebook for PAN at CLEF 2016, (2016)
    ing in Text Sequences CEUR, ISSN 1613-0073, Vol-2046          [20] Natekin, A. and Knoll, A.: Gradient boosting machines,
    urn:nbn:de:0074-2046-8, Proceedings of the 11th Joint Con-        a tutorial, Frontiers in Neurorobotics www.frontiersin.org
    ference on Mathematics and Computer Science Eger, Hun-            December 2013, Volume 7, Article 21, DOI 10.3389/fn-
    gary, May 20-22, 2016.                                            bot.2013.00021, (2013) Article 121, (2019)
[3] Le, Q. and Mikolov, T.: Distributed Representations of Sen-   [21] Loper, E., Bird, S.: NLTK: The Natural Language Toolkit.
    tences and Documents. Proceedings of the 31st International       CoRR, cs.CL/0205028., (2002)
    Conference on Machine Learning, PMLR 32(2):1188-1196,
    (2014)
[4] De Sousa Weber, F.: Semantic Folding (Theory and its Ap-
    plication in Semantic Fingerprinting. White paper, Version
    1.2, 1–59 (2016)

</pre>