=Paper=
{{Paper
|id=Vol-2962/paper04
|storemode=property
|title=Anomaly Detection in Text Documents using HTM Networks
|pdfUrl=https://ceur-ws.org/Vol-2962/paper04.pdf
|volume=Vol-2962
|authors=Zoltán Szoplák,Gabriela Andrejková
|dblpUrl=https://dblp.org/rec/conf/itat/SzoplakA21
}}
==Anomaly Detection in Text Documents using HTM Networks ==
Anomaly detection in text documents using HTM networks
Zoltán Szoplák, Gabriela Andrejková
Institute of computer science, Faculty of Science
P. J. Šafárik University in Košice
Jesenná 5, 04001 Košice, Slovakia
zoltan.szoplak@student.upjs.sk
gabriela.andrejkova@upjs.sk
Abstract: Anomalies in texts can be caused mainly by time, it is sufficient to detect the first anomaly. For rec-
various interventions in texts, such as, by supplementing ommender systems, it is often enough to point out that a
parts of a text from different authors. Such a type of text does contain anomalous parts and leave the rest of it to
anomalies can disrupt text that would otherwise be consis- manual analysis. Such an approach regarding the dataset
tent. In order to find anomalies we have combined multi- in [1] has already been attempted in [2], where the task was
ple algorithms, including a non-traditional neural network to merely determine whether a given text contained any
model - the Hierarchical Temporal Memory (HTM) net- anomalies. The second task is self-explanatory, we aim
work. HTM networks are spatiotemporal predictors based to detect precisely the chunks of texts that are anomalous.
on the neocortex that combine the ability to retain mem- It is important to note that while the algorithm presented
ories of time sequences like recurrent neural networks does contain a measurement of the degree of anomalous-
with the spatial representations of convolutional neural ness, we consider sections either fully anomalous or fully
networks. anomaly free, as the dataset suggests. The first problem
To represent the text inputs for the HTM algortihm we classifies a text as fully anomalous if there is even a single
use semantic folding, which encodes text differently than anomaly present, while with the second class we will be
other embedding method: as a collection of contexts they labelling individual sections as anomalous or not.
occur in. Alongside such a predictor we use numerous
other, more well known metrics, and combine them into a We proposed to experiment with streaming analysis de-
two step algorithm. In the first step we find the division tection taking into account context and narrative progres-
points between the anomalous and non-anomalous parts. sion by analyzing texts sentence by sentence. To imple-
In the second step we determine which sections located ment such analysis, multiple encoding methods were con-
between two division points are actually anomalous. The sidered. Embedding methods such as Doc2Vec, described
algorithm was tested on 40 benchmark texts from PAN in [3] encode the exact word for word composition of
plagiarism corpus PAN-PC-11 and the results of determin- the sentence. While useful for many tasks, we wanted a
ing whether the text contains anomalies or not are 100 %. method that would be able to extract and compare the con-
The percentage of fully detected anomalies is 70.15 %. text and topic of a sentence, rather than it’s exact word-
Keywords: HTM network; Semantic folding; Text ing. We therefore chose to use the Semantic Folding The-
anomaly detection ory, described in [4] and combined it with the HTM algo-
rithm [5], which is a neural network designed specifically
to work with the kinds of representations that the Seman-
1 Introduction tic Folding Theory creates. Since there can be types of
anomalies that are not semantic ones or cases where a se-
Anomalies in texts are certain types of outliers which can mantic change is not indicative of an anomaly, we have
be detected in parts of texts different from the rest of the decided to implement more metrics that are designed to
text. Anomaly detection is therefore the task of identify- capture syntactic and statistical information as well, sup-
ing such parts of text that deviate from the rest of the text plementing the predictions made by the HTM network as
to a suspicious degree. In this paper, we are concerned well as comparing their usefulness to the aforementioned
with creating a method capable of detecting anomalous or method.
plagiarized sentences in English language texts.
We have formulated two problems:
The article is conceived as follows: In the second sec-
T1 Determining whether the text itself is anomalous (the tion we discuss the current state of the art. In the third
text contains an anomal part) section we describe the "Semantic Folding (SF)" method.
In Section 4, we provide a brief description of HTM net-
T2 Determining the number and location of anomalies in
works. Section 5 is devoted to the description of our new
text
algorithm for finding anomalies. The results are processed
Although solving the second problem provides a solu- in the sixth section while the seventh section provides a
tion to the first, the first problem is solvable in a shorter conclusion.
_____________________
Copyright ©2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
2 Related works The aforementioned research has inspired us to make
use of these networks. To satisfy the input reuqirements
of HTM networks, we needed to find a way to encode text
Anomalies can be considered a kind of outlier. General
data as Sparse Distributed Representations (SDRs).
methods for finding outliers and anomalies such as those
In natural language processing (NLP), a method called
in Argavall [6] and Chandola [7] are also useful for finding
"Semantic Folding (SF)" [4] is important, which allows to
anomalies in texts, but texts are a special type of data for
store text data as sparse distributed representations.
which special methods can be used.
Zhuang et al. [8] developed a generative model to iden-
tify frequent and characteristic semantic regions in the 3 Semantic Folding
word embedding space to represent the given corpus, and a
Semantic folding theory creates Sparse Distributed Repre-
robust outlierness measure which is resistant to noisy con-
sentations (SDRs) of text data called semantic fingerprints
tent in documents. Experiments conducted on two real-
to emulate the structure of the neocortex, the area of the
world textual data sets showed that the method can achieve
brain that is responsible for several high-level cognitive
very strong improvement over outlier ranking.
functions, such as vision, hearing, touch, movement or,
In Kannan et al. [9] a matrix factorization method is pre-
most relevant to our case: language. A single fingerprint
sented, which is naturally able to distinguish the anomalies
ideally represents contexts and clusters of contexts that are
using low rank approximations of the underlying texts.
present in a given text. Such representation is achieved
Young et al. [16] review significant deep learning re- by first gathering a corpus representative of the text that
lated models and methods that have been used for numer- we aim to encode, slicing it into snippets (sequences of
ous NLP tasks. They also summarize, compare and con- words, usually paragraphs) and arranging them into a 2D
trast the various models and put forward a detailed under- array using self-organizing maps. As a consequence, sim-
standing of the past, present and future of deep learning in ilar snippets (those that share a lot of words) end up close
NLP. together, forming clusters.
Recently, several articles have been published on the After creating the array of our representation, we can
search for anomalies using HTM networks, described in obtain the semantic fingerprint of a word by checking ev-
Hawkins et al. [10, 11]. Important publications on the use ery single context of the array whether it contains the input
of HTM networks in finding anomalies include the paper word or not. We set the given index to 1 in case the word
of Ahmad and Purdy [12]. They presented a novel HTM is present in the snippets of the context and 0 otherwise.
based on-line sequence memory anomaly detection tech- The result is a sparse vector since most words only occur
nique for time-series data. They demonstrated impressive in a handful of contexts, therefore it is preferable to store
results from a live application that detects anomalies in fi- only the indices that have a value of 1 to save memory.
nancial metrics in real time. Since looking through every single context for each in-
In another article Ahmad et al. [13], it is proposed a put word is very time-consuming, it is preferable to sim-
novel anomaly detection algorithm that works on stream- ply create a vocabulary of words from our corpus, calcu-
ing data. The technique is based on an online sequence late the encodings for each word and store their represen-
memory algorithm based on HTM. They presented re- tation in a database. If we want to encode collections or
sults using the Numenta Anomaly Benchmark (NAB), a sequences of words, we simply take the fingerprint of each
benchmark containing real-world data streams with la- individual word, concatenate the active bits for every index
beled anomalies. and then activate only the indices where the number of ac-
Cui et al. [14] presented a comparative study of HTM tive bits in all the fingerprints exceeds a certain threshold.
networks, a neurally-inspired model, and other feedfor- Merging the fingerprints in such a way allows us to re-
ward and recurrent artificial neural network models on tain sparsity and prune the representation of all contexts
both artificial and real-world sequence prediction algo- that may be relevant to the individual words, but are irrel-
rithms. They informed that HTM and long-short term evant in the context they occur in. Such a representation
memory (LSTM) networks gave the best prediction accu- does have a few perks of note. First, every individual bit in
racy. HTM has many other beneficial properties and fea- our representation has its specific individual meaning, un-
tures that are desirable for real-world sequence learning. like word encodings such as Bag-of-words or ASCII code.
Hole [15] concentrated on understanding how the HTM Furthermore comparing representations can be as easy as
learning algorithms can detect anomalies in complex adap- calculating the number of overlapping active bits. We can
tive information and communications technology (ICT) also take advantage of the fact that similar contexts are
systems. HTM finds anomalies in real-time streaming clustered and compare representations using metrics re-
data. There is no need to store huge amounts of data liant on geometry such as Euclidean and cosine distance.
since HTM builds models representing the properties of While such representation does not take into account the
the data. They examined anomalies in Amazon Web Ser- word order and thus is unsuitable for language generation,
vices (AWS) streaming data and then studied how HTM it is very much suitable for topic matching and preventing
detects rogue human behavior. semantic drift.
4 Hierarchical Temporal Memory
The human neocortex learns by recognizing patterns of
information sequences representing sensory "inputs" and
predicts following likely likely values based on previous
observations. Hierarchical Temporal Memory (HTM) is a
type of neural network that tries to reproduce the structure
and processes of the neocortex. More detailed description
of HTM is given in [5].
4.1 HTM Network Structure
A HTM network has an inherently bottom-up hierarchi- Figure 2: SP - creating SDR of layer based on input [5].
cal structure, as seen in Figure 1, comprised of multiple
layers of 3-dimensional arrays of bits, referred to as cells.
Interconnected layers form a hierarchy where each layer is
• Temporal Memory (TM) forms connections from the
connected to the one below it, with the one at the bottom
cells active in the current step to cells that were active
connecting to the input of the network itself. A layer itself
just prior and makes predictions. The algorithm uses
is a 3-dimensional array of bits comprised of columns ar-
Hebb’s rule, where connections are formed between
ranged in a 2D array topology, where a column is made up
cells that were previously active. Through the forma-
of a single or multiple cells. Each column is connected to a
tion of those connections a sequence may be learned.
subset of the input/previous layer, and cells in a single col-
The TM can then use its learned knowledge of the
umn are also connected to the cells above and below. The
sequences to form predictions.
hierarchy of layers is inspired by biology, with the neocor-
tex consisting of multiple regions that either receive their
input directly from sensory organs or from other regions
connected to them.
Figure 3: TM - representation of a layer after determining
predictive cells (predictive darker cells, white non active
cells, grey active cells) in TM. [5].
Figure 1: Structure of HTM Network [5]. 4.2 Using HTM in our Algorithm
The explicit mathematical description of computations for
The learning algorithm of HTM networks can be di- the HTM network can be found in Mnatzaganian et al.
vided into two steps [17]: [17]. It describes all aspects of the spatial pooler, a crit-
ical learning component in HTM, under a single unifying
• Spatial Pooler (SP) creates a Sparse Distributed Rep- framework. The primary learning mechanism is explored,
resentation (SDR) based on the input and can be where a maximum likelihood estimator for determining
viewed as a mapping function from the input domain the degree of permanence update is proposed.
to a new feature domain where the meaning of the The HTM algorithm, in essence, allows to take an array
input is preserved while ensuring that the representa- of bits as an input and to predict the input of subsequent
tion in the feature domain remains sparse. The algo- steps. Such predictions may be enough for some tasks,
rithm is a type of unsupervised competitive learning however, in order to detect anomalies, we need to extract
algorithm that uses a form of vector quantization re- the occurring differences in the patterns presented in the
sembling self-organizing maps (SOMs). inputs.
Let’s define vector xt as the input generated in our sys- stores the last k raw anomaly scores. In addition, we de-
tem at time t. Then the sequence of inputs to the algorithm fine the W 0 window with a size of j which is much smaller
can be defined as x1 , . . . , xt−1 , xt , xt−1 , . . . , xk , . . . possibly than k that is used to calculate a small moving average
continuing until the detection task is stopped manually. of the last few anomaly scores. It makes a more accurate
In general, the goal of streaming anomaly detection is comparison metric than a single score. We calculate the
to find abnormalities in inputs as soon as they occur. Such anomaly likelihood using the Q-function (the tail distri-
a real time constraint means that for the purposes of an bution function to the Gaussian normal distribution func-
anomaly detection at time t the inputs of earlier times tion), where the mean and variance of raw anomaly scores
(1, ...,t − 1) are accessible only, not the input at time t + 1 for the normal distribution function are recalculated every
or later. The HTM algorithm is capable of detecting such time using the values of anomaly scores in our window-
anomalies as well by using a method described in [12]. sized memory.
Let’s define two variables to explain the HTM anomaly The anomaly likelihood metric at time t is defined as the
detection algorithm: Let a(xt ) be the sparse binary repre- complement of the tail probability:
sentation of the input vector xt , defined by the binary 2D
matrix of all cells within a region, where ai, j (xt ) is set to µ̄t − µt
Lt = 1 − Q (2)
1 if the ith cell of the jth column is in the active state and σt
to 0 otherwise. Let π(xt ) be the sparse binary representa-
where:
tion of the prediction of the next input - a(xt+1 ), defined
by the binary 2D matrix of all cells within a region, where 1
Z ∞ 2
u
πi, j (xt ) is set to 1 if the ith cell of the jth column is in the Q(x) = √ − exp − du (3)
2π x 2
predictive state and to 0 otherwise. The values of the pre-
diction matrix are greatly influenced not just by the input 0
∑i=W −1 st−i ∑i=W −1 st−i
itself, but the context as well. µt = i=0 , µ̄t = i=0 (4)
k j
Therefore the accuracy of the algorithm prediction is
largely dependent on its ability to model the data. These ∑i=W −1 2
(st−i − µt )
two variables are calculated at each step of the algorithm, σt2 = i=0
(5)
k−1
however, they do not contain sufficient information to find
anomalies by themselves. Instead, we use these variables Naturally, the greater the likelihood score the more
to compute a raw anomaly score for each timestamp, la- probable it is that we have found the offset of an anomaly.
belled st . The raw anomaly score essentially gives us a
measurement of a deviation between the predicted and the 5 Anomaly Detection in Texts
actual input. The raw anomaly score is given by:
π(xt−1 )a(xt ) The new developed system combines the previously de-
st = (1) scribed methods and works in two steps:
|a(xt )|
Both variables are binary vectors, the multiplication is 1. Finding the locations of the text changes. The first
the inner product, divided by the size of the output. The step is to create an algorithm that can find the exact
less the prediction was correct the larger our anomaly locations where the style of the text changes, whether
score is. The value of st is therefore a scalar value between from non-plagiarized to plagiarized or vice-versa. We
0 and 1, 0 meaning the prediction was perfect, 1 meaning consider sentences the smallest unit of text to be ana-
nothing has been correctly predicted. A weak prediction lyzed this way as we do not expect sentences that have
would therefore be indicative of an anomaly, however it both anomalous and non-anomalous parts within them.
does not take into account the predictive capability of our Therefore finding the locations of text changes in prac-
network on the amount or noise present in the text. tical terms means finding the offsets of the first sen-
To counteract this, we calculate the distribution of tence of the anomalous part and the first sentence of
anomaly scores within a certain time window, and there- the non-anomalous part.
fore find the likelihood instead of simply tresholding the 2. Filtering out the non-anomalous potential se-
raw anomaly score. The anomaly likelihood metric is de- quences. The second step is to take the offsets of the
signed to measure a change in predictability, rather than first step and filter out the sections located between two
change in the input pattern and thus it accounts for the be- offsets that are not actually anomalous.
ginning of the text where predictability is low. The metric
is ideal for detecting not only the starting points of the
anomalies but their ending points as well (since in that 5.1 Finding the Locations of Text Changes
case the predictability would suddenly get much more ac- We have created the following features for the purpose of
curate). determining where such points of change lie:
To calculate the anomaly likelihood metric, we use a
large moving window represented by the vector W that F1. The anomaly likelihood score of each sentence.
F2. The cosine similarity of the Doc2Vec vectors and word is present (which might not be by itself indicative of
their autoencoder predictions. an anomaly).
F3. The Euclidean distance between the current finger- To calculate F9, we use the the mean relative frequency
print and the fingerprint of preceding sentences. of the author style metric described in [19]. We calculate
the frequency of each word in the sentence as well as in the
F4. The cosine similarity between the current fingerprint document. We average these values across all of the words
and the fingerprint of preceding sentences. and compare the mean value of the sentence frequencies
F5. The Jaccard index between the current fingerprint and the document frequencies to get the difference in au-
and the fingerprint of preceding sentences. thor styles.
We use a Gradient Boosting Classifer (GBC), described
F6. The average relational frequency of the words con- in [20], to combine the predictive capability of the afore-
tained within a sentence. mentioned features as well as evaluate their relative im-
F7. The lowest relational frequency of the words con- portance. The GBC is an ensemble of multiple models,
tained within a sentence. specifically decision trees, which has superior prediction
capability compare to the individual models. It does so by
F8. The highest relational frequency of the words con-
iteratively adding models to the combined model that min-
tained within a sentence.
imize the residual loss obtained by the combined model in
F9. The difference of the mean relational word frequency the previous step.
of the sentence and the text. To label the dataset, we choose a rather simplistic
method: We label all sentences as zeroes except the first
To calculate F1, we use the semantic fingerprint method sentence in each anomaly and the first sentence after the
to create a fingerprint of each sentence and use them as anomaly ends. Our prediction of these division points is
inputs to the HTM network described in Section 3 and then then used to make the prediction about where the potential
calculate the anomaly likelihood score. anomalous sections might be.
To calculate F2, we use Doc2Vec, an embedding
method described in [3] to create a vector representation
of each sentence of our text. We then use it as inputs to 5.2 Filtering out Non-anomalous Potential Sequences
train an autoencoder to first create an encoding of lesser di-
mensionality and then reconstruct an output from the code Now that we have our division points, we still need to iden-
that best matches the input. After training the autoencoder, tify which sections lying between two points are anoma-
we calculate the cosine similarity between the input em- lous and which are not. If we merely tagged them in an
bedding and the reconstructed output embedding. Since alternating fashion, it would lead to a large number of er-
most of the text is expected to come from a single source rors, as a single faulty division point could mean that we
with occasional anomalous parts inserted, the network, misclassify our entire dataset. We need some way of de-
through training, creates generalized reconstructions that termining the anomalous nature of individual sections. We
have more success reconstructing the inputs from the orig- can assume that most anomalous sections are relatively
inal text than from the anomalous parts. short compared to non-anomalous sections. We can pair
To calculate F3, F4 and F5, we use semantic finger- up indices of division points that are located close to one
prints, comparing the fingerprint of each sentence to the another, specifically 50 sentences from each other. While
merged fingerprint of the preceding 5 sentences in a win- we can create possible pairings of anomalous parts that
dow (determined to be the average anomaly length). We are located close to one another, there are individual divi-
use 3 different metrics of comparison: Euclidean distance, sion points that cannot be paired. They might be a false
cosine similarity and Jaccard distance. The metrics are positive or they corresponds to beginning or ending that
suited to detecting the first sentence of an anomalous sec- has not been found yet. For each isolated index, we con-
tion as well as the first sentence of a non-anomalous sec- struct multiple artificial sections that are created varying
tion that follows an anomalous section. distances before or after it. The exact process is described
To calculate F6, F7 and F8, we use relational frequency in algorithm 1.
metrics described in [18], which measure how specific a Such pairings inevitably lead to false positives, there-
current word is to a given segment. We calculate the fre- fore it is vital to devise a method that can recognize gen-
quency of the most and least frequent word as well as the uine anomalous parts.
mean frequency of all words in a sentence. Anomalous We use a gradient boosting regressor, to predict the
sentences are presumed to have low relational frequency anomalousness of the potential section (the percentage of
scores due to having words atypical for the rest of the text. anomalous sentences from the section).
A low mean relational frequency means that a sentence There are multiple parameters that we use as inputs,
contains many unique words. A low highest relational some from the previous step and others from modified al-
frequency means that unique stopwords are used. A low gorithms that deal with sections instead of sentences. In-
lowest relational frequency means that a sentence unique put values for the regressor are the following:
Algorithm 1: Algorithm for creating potential Algorithm 2: Algorithm for merging overlapping
anomalous sections from division points sections
Data: P - Sorted list of division points, where the Data: S - List of potential anomalous sections that
value of each point is the index of the are defined by the index of their starting
sentence within the document sentence and the index of their last sentence
Result: S - List of potential anomalous sections Result: S - List of potential anomalous sections
i←1 without overlaps
while i ≤ length(P) do OverlappingSections ← T RUE
if P[i] − P[i − 1] ≤ 50 then while OverlappingSections do
S.addSection(P[i − 1], P[i]) OverlappingSections ← FALSE
else for i in range(0, length(S) − 1) do
if P[i + 1] − P[i] ≤ 50 then for j in range(i, length(S)) do
S.addSection(P[i], P[i + 1]) if overlaps(S[i], S[ j]) then
else OverlappingSections ← T RUE
for j in range(1,10) do f irst = min(S[i]. f irst, S[ j]. f irst)
S.addSection(P[i] − 5 × j, P[i]) last = max(S[i].last, S[ j].last)
S.addSection(P[i], P[i] + 5 × j) S.addSection( f irst, last)
end S.removeSection(S[i])
end S.removeSection(S[ j])
end end
end end
end
end
• The mean value of F2, F6-F9 for every sentence of
the section (metrics using Semantic Folding are not
used due to them having high values in using division a collection of 4753 larger text bodies in the English lan-
points). guage, made up of various topics, where some texts have
• The Euclidean distance between the fingerprint of the plagiarised parts artificially inserted into them. We have
section and the fingerprint of the entire document. chosen 40 texts from the corpus to test our system.
For each text, we have two files: a.txt file containing
• The cosine similarity between the fingerprint of the the texts themselves and an.xml file containing various
section and the fingerprint of the entire document. metadata such as the source of the base text and the author
• The Jaccard index between the fingerprint of the sec- and most importantly: the list of anomalous parts defined
tion and the fingerprint of the entire document. by the division points and the length of such a part. The
• The average relational frequency of the words con- number of plagiarized parts in texts is not constant and can
tained within the entire section. contain 0, with no plagiarism inserted, up to 10 anomalous
parts. Anomalous parts are whole sentences or entire para-
• The lowest relational frequency of the words con- graphs. Thus we only consider entire sentences or larger
tained within the entire section. collections of sentences as possible anomalies.
• The highest relational frequency of the words con- First we remove all stop words from the text, such as a,
tained within the entire section. the, is, at, which, on etc., since these characters have little
• The difference of the mean relational word frequency relevance upon the meaning or authorial style of the text.
of the section and the text. We solved two tasks as follows:
Task 1: To find out if the suspicious text is anomalous:
After training our regressor, we treshold the anomalous- Here, we are only trying to determine whether a text con-
ness values to obtain the list of anomalous sections. Due tains any anomalies or not, ignoring the location or the
to the artificial creation of sections, overlaps may be possi- number of anomalies present.
ble. We describe a way to merge these section in algorithm Task 2: To identify the individual anomalous parts
2. within the text correctly, taking into consideration their
precise locations.
6 Results We first split the sentences using the NLTK (Natural
Language Toolkit) tokenizer, described in [21], marking
6.1 Data preparation certain words ending with a punctuation mark as excep-
tions when splitting (such as st., mr., mrs., dr. etc.). We
We worked with the PAN intrinsic anomaly detection pla- removed also all non-text characters, all stop words as
giarism corpus 2011 [1] as experimental text data. It is well as words shorter than 3 letters. Finally, we also
merged sentences that consist of only a single word with Features F3-F5 are calculated just as described in Sec-
the first non-single word sentence that comes after it. Such tion 5, using the five sentences preceding the current sen-
a way of merging allows us to avoid a lot of false positives tence to form our merged fingerprint and comparing it to
which would occur by having sentences that have very lit- the fingerprint of the current sentence. Features F6-F9 are
tle meaning on their own. We also store which sentences calculated just as described in Section 5, with calculating
they were merged from and apply the features calculated the relational frequencies and author style separately for
to all of the sentences. every sentence on a text that did not have short words or
We then calculate the features for each sentence. For stop words removed.
the merged fingerprints of the preceding sentences used We used these features as the input parameters to our
for F1, F3, F4 and F5, we have chosen to merge the Gradient Boosting Classifier with 200 estimators and a
fingerprints of five sentences that came before the cur- maximum depth of 4. After training our classifier, we can
rent sentence. In case of the first sentence, we do not get the positive examples of sentence offsets and construct
have a merged fingerprint, thus we use the current fin- the potential sections from them. To filter them out, we
gerprint as the input to the Temporal Pooler as is. In a train a Gradient Bossting regressor with 200 estimators
case where there are less than 5 sentences available we and a depth of 4 to predict their relevance. We then thresh-
use a merged fingerprint consisting from sentences that are old the prediction to get the potential anomalous parts.
available. Then, in case of F1, as was described, we create We experimented with multiple thresholds and found the
a fingerprint where only those bits have a value of 1 that threshold value of 0.7 as sufficient. Finally, we merge the
are active in both the merged print and the current print potential sections and then compare them with the actual
for each sentence. The fingerprints arrays have a size of anomalous sections using multiple metrics, such as pre-
128 × 128 and use the standard English associative dictio- cision, recall and accuracy to measure our success. For
nary of cortical.io. As the fingerprints are encoded as evaluation, we consider the anomalies as positive exam-
a collection of active bits, we have a list of sorted indices ples and label every single character of the full unmodified
that are between 1 and 16384. text, therefore taking into account the length of sentences
To increase computational efficiency, we split the list as well.
into four parts, each having a portion of the values in the
following way: the first list contains all of the values be- 6.3 Experimental results
tween 1 and 4096, the second between 4097 and 8192,
the third between 8193 and 12228 and the fourth between We have evaluated the predictions of anomalies made by
12229 and 16384. We then create sparse vectors from our model and organized the results into Table 1. For com-
these lists, by having 1 in an index that is present in the parison, we have implemented the Author style algorithm
list and 0 everywhere else. Such a division does not pose proposed by Kuznetsov et al. [19] and evaluated it on our
much of a problem for the prediction capabilities of HTM dataset as well. The results have also been organized into
algorithm, as it has a feature that allows it to separate in- Table 1, alongside the results of our own algorithm.
dividual patterns that represents a sequence of inputs that The columns of the table are arranged in the following
still belongs to a single object. manner: The Txt column corresponds to the id number of
the text. The Plag det column tells us how many anomalies
did our algorithm detect fully out of the number of plagia-
6.2 HTM Network and Gradient Boosting Classifier risms present. The T1 (Task 1) column tells us the con-
Training clusion our algorithm reached when determining whether
the text is anomalous or not (N for non anomalous, Y for
We first fit the HTM network using the entire text once as anomalous). Columns Pre, Rec and Acc refer respectively
the training set and then run the same inputs through the to the precision, recall and accuracy values achieved by
network to get our predictions. We calculated the anomaly our algorithm with classifying anomalies. Columns Pre
scores and from them the likelihood scores for each sen- A2, Rec A2 and Acc A2 are the results achieved by [19]. If
tence of the document. For the moving window of mean there are no predicted anomalous parts/sentences, the pre-
anomaly scores, we used a window size of 5 that stores the cision metric has a values N/A since calculating the met-
previous 5 anomaly scores. For F2, we have used Doc2Vec ric becomes meaningless. Likewise if there are no actual
to create 100 dimensional vectors from each sentence. We anomalous parts/sentences present in the text, the recall
have chosen a five layered feed-forward perceptron as our metric similarly has a value of N/A.
autoencoder network, with the layer sizes being (experi- As we can see in Table 1, our algorithm has achieved
mentally chosen): 100, 50, 20, 50, 100 with the first and an accuracy of 100 % in T1. Regarding T2, the overall
last layer being the same size as the Doc2Vec inputs. We percentage of fully found plagiarisms is 70.15 %. The re-
trained the network on the text in 5 epochs (experimen- sults show high accuracy due to the far greater number of
tally chosen), then performed the reconstruction for each non-anomalous sections in the actual text as well as high
sentence and calculated the cosine similarity between the precision due to the number of false positives being fur-
input and the reconstruction. ther by filtration. The downside of such extensive filtering
can be seen in the recall values however, where the results
Table 1: Information about 40 English texts from corpus
vary. Such a discrepancy between the precision and re-
call values can be observed with both algorithms. When it [1].
Plag Pre Rec Acc
comes to accuracy, the results of the author style algorithm Txt
det
T1 Pre Rec Acc
A2 A2 A2
have only been better in case of text 11, equal in multiple 1 0/0 N N/A N/A 1.0 0.0 N/A 1.0
cases and inferior to ours in most. 2 0/0 N N/A N/A 1.0 0.0 N/A 1.0
When looking at the metrics of precision and recall, they 3 6/8 Y 0.86 0.64 0.98 0.73 0.66 0.97
tend to achieve a better, sometimes perfect precision score, 4 3/4 Y 0.97 0.44 0.98 0.70 0.15 0.96
but this usually comes at the cost of a significantly worse 5 0 N N/A N/A 1.0 0.0 N/A 0.98
recall and accuracy score. While precision is important, 6 4/5 Y 0.96 0.70 0.98 1.0 0.24 0.96
we believe that for recommender systems, it is better to 7 11/16 Y 0.96 0.57 0.94 0.97 0.32 0.90
8 0 N N/A N/A 1.0 0.0 N/A 1.0
create a few false positives in order to flag most of the
9 7/13 Y 0.97 0.74 0.94 0.85 0.67 0.91
anomalous sections, rather than being more sure with the
10 0 N N/A N/A 1.0 0.0 N/A 0.99
anomalousness of fewer sections. 11 5/11 Y 0.97 0.50 0.90 0.91 0.57 0.91
As for the performance of the author style algorithm 12 0 N N/A N/A 1.0 N/A N/A 1.0
on T1, it failed to achieve 100 % accuracy, due to often 13 6/10 Y 0.97 0.73 0.93 1.0 0.07 0.77
predicting sentences to be anomalous when there are no 14 6/6 Y 0.85 1.0 0.99 1.0 0.02 0.96
anomalies to be found (such as texts 5 or 10 in Tale 1) 15 0 N N/A N/A 1.0 N/A N/A 1.0
or not finding any anomalies despite the text containing 16 8/13 Y 0.87 0.60 0.90 0.75 0.05 0.78
them (such as texts 23 or 28). It appears that our results 17 2/2 Y 0.81 1.0 0.99 1.0 0.08 0.96
mostly surpass the results generated by the author style 18 4/4 Y 0.92 1.0 0.98 1.0 0.87 0.96
19 6/9 Y 0.97 0.54 0.93 0.98 0.11 0.87
algortihm, therefore showing that sequential anomaly de-
20 13/16 Y 0.91 0.81 0.97 0.89 0.73 0.95
tection is worth consideration. 21 8/12 Y 0.96 0.65 0.97 0.90 0.69 0.96
However the disparity in precision is still not ideal, and 22 6/6 Y 0.83 1.0 0.99 0.88 0.27 0.96
lovering our threshold did not give us much improvement 23 6/7 Y 0.93 0.94 0.99 N/A 0.0 0.95
in recall in comparison to the decrease in precision. We 24 0/0 N N/A N/A 1.0 N/A N/A 1.0
believe this is occuring because of short anomalous sec- 25 4/5 Y 0.87 0.98 0.99 1.0 0.21 0.97
tions, because our artificially constructed anomalous sec- 26 5/12 Y 0.98 0.52 0.88 0.94 0.53 0.88
tions are at least 5 sentences long, which might be more 27 0/0 N N/A N/A 1.0 N/A N/A 1.0
than the minimum length of said anomalies. We can see 28 4/5 Y 0.69 1.0 0.98 N/A 0.0 0.95
29 6/8 Y 0.88 0.72 0.99 1.0 0.07 0.96
that a lot more depends on the division point detection part
30 3/4 Y 0.94 0.94 0.99 N/A 0.0 0.95
of the algorithm, therefore we have decided to plot the im-
31 7/7 Y 0.94 1.0 0.99 1.0 0.18 0.90
portance of features for our classifier, showed in Figure 4. 32 1/3 Y 0.99 0.15 0.96 0.82 0.09 0.95
33 0/0 N N/A N/A 1.0 N/A 0.0 1.0
34 0/0 N N/A N/A 1.0 N/A 0.0 1.0
35 2/6 Y 0.94 0.40 0.86 0.99 0.20 0.82
36 0/0 N N/A N/A 1.0 N/A N/A 1.0
37 8/9 Y 0.88 0.97 0.98 1.0 0.07 0.87
38 0/0 N N/A N/A 1.0 0.0 N/A 1.0
39 0/0 N N/A N/A 1.0 N/A N/A 1.0
40 0/0 N N/A N/A 1.0 N/A N/A 1.0
Figure 4: Feature importance of the gradient boosting clas- The metrics that determine the most unique sentences
sifier instead of division points are also quite important in the
prediction as demonstrated by the usefulness of the cosine
similarity of the Doc2Vec vectors as well as the mean rela-
As we can see, the most successful feature used was F9, tional frequency of words. In case of the lowest frequency
the mean relational word frequency comparison between metric, there may easily be sentence-unique words in sen-
the whole text and the sentence. The anomaly likelihood tences that are from non-anomalous parts and sentences
is very close, being designed specifically to detect points from anomalous parts that do not have sentence-unique
of change. We can see that using Semantic Folding with- words. As for the highest frequency, while some anoma-
out a strong predictor like HTM decreases the performance lous sentences might not have the same stop words as the
as evidenced by the relative uselessness of the fingerprint rest of the sentences, others might, making it not all that
comparison metrics (Euclidean, Cosine and Jaccard). reliable.
7 Conclusion [5] Hawkins, J., Ahmad, S., Purdy, S. and Lavin, A: Biological
and Machine Intelligence (BAMI), http://numenta.com/
business-strategy-and-ip/. Release 0.4, (2017)
We have developed a method capable of detecting intrin-
[6] Aggarwal, C. C.: Outlier analysis. Springer
sic anomalies in natural text based on sequential analysis
Science+Business Media New York, (2013).
of the syntactic and semantic patterns exhibited by poten- https://doi.org/10.1007/978-3-319-47578-3
tially anomalous text. The algorithm consists of the two
[7] Chandola, V., Banerjee, A. and Kumar, V.: Anomaly detec-
step process of identifying the location of the starting and tion: A survey. ACM Comput. Surv., 41(3), (2009)
ending sentences of anomalies and determining whether a
[8] Zhuang, H., Wang, Ch., Tao, F., Kaplan, L. and Han, J.:
section located between two such points is anomalous. For Identifying Semantically Deviating Outlier Documents. Pro-
both of these steps we use gradient boosting to form a pre- ceedings of the conference on Empirical Methods in Natural
diction out of various metrics extracted from the text using Language Processing, pages 2748–2757, Copenhagen, Den-
the Semantic Folding algorithm, HTM network, Doc2Vec, mark, September 7–11, 2017. c 2017 Association for Com-
autoencoders and metrics based on word frequencies. putational Linguistics
The algorithm was tested and evaluated on 40 English [9] Kannan, R., Woo, H., Aggarwal, C. C. and Park, H.:
texts with artificially inserted plagiarisms from the PAN Outlier Detection for Text Data: An Extended Version.
intrinsic anomaly detection plagiarism corpus 2011. Our arXiv:1701.01325v1, (2017)
objective was twofold, the first, to determine whether the [10] Hawkins, J. and Ahmad, S.: Why neurons have thousands
text contains any anomalies at all and the second, to de- of synapses, a theory of sequence memory in neocortex.
termine the exact number and position of the anomalous Frontiers in Neural Circuits 10(23), 1–13 (2016)
passages. The algorithm achieved an accuracy of Task 1 [11] Hawkins, J., Lewis, M., Klukas, M., Purdy, S., Ahmad, S.:
was 100 % and better-than-expected results in Task 2, de- A framework for intelligence and cortical function based on
pending on the text with relatively high values of precision grid cells in the neocortex. Frontiers in Neural Circuits, Vol.
and accuracy, but varying recall. The overall percentage of 12,
fully found plagiarisms within the text was 70.15 %. We [12] Ahmad, S. F. and Purdy, S.: Real-Time Anomaly Detection
found that our algorithm was able to improve on the re- for Streaming Analytics. arXiv:1607.02480v1, (2016)
sults of Kuznetsov et al. [19]. While there is much room [13] Ahmad, S, Lavina, A., Purdy, S., Agha, Z.: Unsupervised
for improvement, we believe the paradigm of continuous real-time anomaly detection for streaming data. Neurocom-
puting, 267 (2017), 134-147
analysis in plagiarism detection to be an interesting and
valid one that provides non-dismissible results and might [14] Cui, Y., Surpur, Ch., Ahmad, S., Hawkins, J.: A compar-
ative study of HTM and other neural network models for
be a step in the right direction when solving similar prob-
online sequence learning with streaming data, 2016 Interna-
lems. tional Joint Conference on Neural Networks (IJCNN), 2016,
pp. 1530-1538, doi: 10.1109/IJCNN.2016.772738
Acknowledgements. The research is supported by [15] Hole, K. J.: Anomaly Detection with HTM. Chapter 12 in
the Slovak Scientific Grant Agency VEGA, Grant No. the book: Anti-fragile ICT Systems, Springer 2016
1/0177/21 “Descriptional and Computational Complexity [16] Young, T., Hazarika, D., Poria, S., Cambria, E.: Recent
of Automata and Algorithms”. Trends in Deep Learning Based Natural Language Process-
ing. 2018, arXiv:1708.02709v8 [cs.CL]
[17] Mnatzaganian, J., Fokoué, E. and Kudithipudi, D.: A Math-
References ematical Formalization of Hierarchical Temporal Memory’s
Spatial Pooler. arXiv:1601.06116v3, (2016)
[1] Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeño, [18] Oberreuter, G., L’Huillier, G., A. Ríos, S., D. Velásquez, J.:
A. and Rosso, P.: PAN Plagiarism Corpus 2011 Approaches for Intrinsic and External Plagiarism Detection,
(PAN-PC-11), DOI 10.5281/zenodo.3250095, (2011) Notebook for PAN at CLEF 2011, (2011)
http://www.uniweimar.de/en/media/chairs/webis/corpora/pan- [19] Kuznetsov, M., Motrenko A., Kuznetsova, R. and Strijov,
pc-11/ V.: Methods for intrinsic plagiarism detection and author di-
[2] Almarimi, A., Andrejková, G., Salem, A.: Anomaly Search- arization, Notebook for PAN at CLEF 2016, (2016)
ing in Text Sequences CEUR, ISSN 1613-0073, Vol-2046 [20] Natekin, A. and Knoll, A.: Gradient boosting machines,
urn:nbn:de:0074-2046-8, Proceedings of the 11th Joint Con- a tutorial, Frontiers in Neurorobotics www.frontiersin.org
ference on Mathematics and Computer Science Eger, Hun- December 2013, Volume 7, Article 21, DOI 10.3389/fn-
gary, May 20-22, 2016. bot.2013.00021, (2013) Article 121, (2019)
[3] Le, Q. and Mikolov, T.: Distributed Representations of Sen- [21] Loper, E., Bird, S.: NLTK: The Natural Language Toolkit.
tences and Documents. Proceedings of the 31st International CoRR, cs.CL/0205028., (2002)
Conference on Machine Learning, PMLR 32(2):1188-1196,
(2014)
[4] De Sousa Weber, F.: Semantic Folding (Theory and its Ap-
plication in Semantic Fingerprinting. White paper, Version
1.2, 1–59 (2016)