<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Anomaly detection in text documents using HTM networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zoltán Szoplák</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriela Andrejková</string-name>
          <email>gabriela.andrejkova@upjs.sk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of computer science, Faculty of Science P. J. Šafárik University in Košice Jesenná 5</institution>
          ,
          <addr-line>04001 Košice</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Anomalies in texts can be caused mainly by various interventions in texts, such as, by supplementing parts of a text from different authors. Such a type of anomalies can disrupt text that would otherwise be consistent. In order to find anomalies we have combined multiple algorithms, including a non-traditional neural network model - the Hierarchical Temporal Memory (HTM) network. HTM networks are spatiotemporal predictors based on the neocortex that combine the ability to retain memories of time sequences like recurrent neural networks with the spatial representations of convolutional neural networks. To represent the text inputs for the HTM algortihm we use semantic folding, which encodes text differently than other embedding method: as a collection of contexts they occur in. Alongside such a predictor we use numerous other, more well known metrics, and combine them into a two step algorithm. In the first step we find the division points between the anomalous and non-anomalous parts. In the second step we determine which sections located between two division points are actually anomalous. The algorithm was tested on 40 benchmark texts from PAN plagiarism corpus PAN-PC-11 and the results of determining whether the text contains anomalies or not are 100 %. The percentage of fully detected anomalies is 70.15 %.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Anomalies in texts are certain types of outliers which can
be detected in parts of texts different from the rest of the
text. Anomaly detection is therefore the task of
identifying such parts of text that deviate from the rest of the text
to a suspicious degree. In this paper, we are concerned
with creating a method capable of detecting anomalous or
plagiarized sentences in English language texts.</p>
      <p>We have formulated two problems:
T1 Determining whether the text itself is anomalous (the
text contains an anomal part)
T2 Determining the number and location of anomalies in
text</p>
      <p>
        Although solving the second problem provides a
solution to the first, the first problem is solvable in a shorter
_____________________
Copyright ©2021 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
time, it is sufficient to detect the first anomaly. For
recommender systems, it is often enough to point out that a
text does contain anomalous parts and leave the rest of it to
manual analysis. Such an approach regarding the dataset
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has already been attempted in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where the task was
to merely determine whether a given text contained any
anomalies. The second task is self-explanatory, we aim
to detect precisely the chunks of texts that are anomalous.
It is important to note that while the algorithm presented
does contain a measurement of the degree of
anomalousness, we consider sections either fully anomalous or fully
anomaly free, as the dataset suggests. The first problem
classifies a text as fully anomalous if there is even a single
anomaly present, while with the second class we will be
labelling individual sections as anomalous or not.
      </p>
      <p>
        We proposed to experiment with streaming analysis
detection taking into account context and narrative
progression by analyzing texts sentence by sentence. To
implement such analysis, multiple encoding methods were
considered. Embedding methods such as Doc2Vec, described
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] encode the exact word for word composition of
the sentence. While useful for many tasks, we wanted a
method that would be able to extract and compare the
context and topic of a sentence, rather than it’s exact
wording. We therefore chose to use the Semantic Folding
Theory, described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and combined it with the HTM
algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is a neural network designed specifically
to work with the kinds of representations that the
Semantic Folding Theory creates. Since there can be types of
anomalies that are not semantic ones or cases where a
semantic change is not indicative of an anomaly, we have
decided to implement more metrics that are designed to
capture syntactic and statistical information as well,
supplementing the predictions made by the HTM network as
well as comparing their usefulness to the aforementioned
method.
      </p>
      <p>The article is conceived as follows: In the second
section we discuss the current state of the art. In the third
section we describe the "Semantic Folding (SF)" method.
In Section 4, we provide a brief description of HTM
networks. Section 5 is devoted to the description of our new
algorithm for finding anomalies. The results are processed
in the sixth section while the seventh section provides a
conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <p>
        Anomalies can be considered a kind of outlier. General
methods for finding outliers and anomalies such as those
in Argavall [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and Chandola [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are also useful for finding
anomalies in texts, but texts are a special type of data for
which special methods can be used.
      </p>
      <p>
        Zhuang et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] developed a generative model to
identify frequent and characteristic semantic regions in the
word embedding space to represent the given corpus, and a
robust outlierness measure which is resistant to noisy
content in documents. Experiments conducted on two
realworld textual data sets showed that the method can achieve
very strong improvement over outlier ranking.
      </p>
      <p>
        In Kannan et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] a matrix factorization method is
presented, which is naturally able to distinguish the anomalies
using low rank approximations of the underlying texts.
      </p>
      <p>
        Young et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] review significant deep learning
related models and methods that have been used for
numerous NLP tasks. They also summarize, compare and
contrast the various models and put forward a detailed
understanding of the past, present and future of deep learning in
NLP.
      </p>
      <p>
        Recently, several articles have been published on the
search for anomalies using HTM networks, described in
Hawkins et al. [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. Important publications on the use
of HTM networks in finding anomalies include the paper
of Ahmad and Purdy [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. They presented a novel HTM
based on-line sequence memory anomaly detection
technique for time-series data. They demonstrated impressive
results from a live application that detects anomalies in
financial metrics in real time.
      </p>
      <p>
        In another article Ahmad et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], it is proposed a
novel anomaly detection algorithm that works on
streaming data. The technique is based on an online sequence
memory algorithm based on HTM. They presented
results using the Numenta Anomaly Benchmark (NAB), a
benchmark containing real-world data streams with
labeled anomalies.
      </p>
      <p>
        Cui et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] presented a comparative study of HTM
networks, a neurally-inspired model, and other
feedforward and recurrent artificial neural network models on
both artificial and real-world sequence prediction
algorithms. They informed that HTM and long-short term
memory (LSTM) networks gave the best prediction
accuracy. HTM has many other beneficial properties and
features that are desirable for real-world sequence learning.
      </p>
      <p>
        Hole [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] concentrated on understanding how the HTM
learning algorithms can detect anomalies in complex
adaptive information and communications technology (ICT)
systems. HTM finds anomalies in real-time streaming
data. There is no need to store huge amounts of data
since HTM builds models representing the properties of
the data. They examined anomalies in Amazon Web
Services (AWS) streaming data and then studied how HTM
detects rogue human behavior.
      </p>
      <p>The aforementioned research has inspired us to make
use of these networks. To satisfy the input reuqirements
of HTM networks, we needed to find a way to encode text
data as Sparse Distributed Representations (SDRs).</p>
      <p>
        In natural language processing (NLP), a method called
"Semantic Folding (SF)" [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is important, which allows to
store text data as sparse distributed representations.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Semantic Folding</title>
      <p>Semantic folding theory creates Sparse Distributed
Representations (SDRs) of text data called semantic fingerprints
to emulate the structure of the neocortex, the area of the
brain that is responsible for several high-level cognitive
functions, such as vision, hearing, touch, movement or,
most relevant to our case: language. A single fingerprint
ideally represents contexts and clusters of contexts that are
present in a given text. Such representation is achieved
by first gathering a corpus representative of the text that
we aim to encode, slicing it into snippets (sequences of
words, usually paragraphs) and arranging them into a 2D
array using self-organizing maps. As a consequence,
similar snippets (those that share a lot of words) end up close
together, forming clusters.</p>
      <p>After creating the array of our representation, we can
obtain the semantic fingerprint of a word by checking
every single context of the array whether it contains the input
word or not. We set the given index to 1 in case the word
is present in the snippets of the context and 0 otherwise.
The result is a sparse vector since most words only occur
in a handful of contexts, therefore it is preferable to store
only the indices that have a value of 1 to save memory.</p>
      <p>Since looking through every single context for each
input word is very time-consuming, it is preferable to
simply create a vocabulary of words from our corpus,
calculate the encodings for each word and store their
representation in a database. If we want to encode collections or
sequences of words, we simply take the fingerprint of each
individual word, concatenate the active bits for every index
and then activate only the indices where the number of
active bits in all the fingerprints exceeds a certain threshold.</p>
      <p>Merging the fingerprints in such a way allows us to
retain sparsity and prune the representation of all contexts
that may be relevant to the individual words, but are
irrelevant in the context they occur in. Such a representation
does have a few perks of note. First, every individual bit in
our representation has its specific individual meaning,
unlike word encodings such as Bag-of-words or ASCII code.
Furthermore comparing representations can be as easy as
calculating the number of overlapping active bits. We can
also take advantage of the fact that similar contexts are
clustered and compare representations using metrics
reliant on geometry such as Euclidean and cosine distance.
While such representation does not take into account the
word order and thus is unsuitable for language generation,
it is very much suitable for topic matching and preventing
semantic drift.</p>
    </sec>
    <sec id="sec-4">
      <title>Hierarchical Temporal Memory</title>
      <p>
        The human neocortex learns by recognizing patterns of
information sequences representing sensory "inputs" and
predicts following likely likely values based on previous
observations. Hierarchical Temporal Memory (HTM) is a
type of neural network that tries to reproduce the structure
and processes of the neocortex. More detailed description
of HTM is given in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>HTM Network Structure</title>
        <p>A HTM network has an inherently bottom-up
hierarchical structure, as seen in Figure 1, comprised of multiple
layers of 3-dimensional arrays of bits, referred to as cells.
Interconnected layers form a hierarchy where each layer is
connected to the one below it, with the one at the bottom
connecting to the input of the network itself. A layer itself
is a 3-dimensional array of bits comprised of columns
arranged in a 2D array topology, where a column is made up
of a single or multiple cells. Each column is connected to a
subset of the input/previous layer, and cells in a single
column are also connected to the cells above and below. The
hierarchy of layers is inspired by biology, with the
neocortex consisting of multiple regions that either receive their
input directly from sensory organs or from other regions
connected to them.</p>
        <p>
          The learning algorithm of HTM networks can be
divided into two steps [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]:
        </p>
        <p>Spatial Pooler (SP) creates a Sparse Distributed
Representation (SDR) based on the input and can be
viewed as a mapping function from the input domain
to a new feature domain where the meaning of the
input is preserved while ensuring that the
representation in the feature domain remains sparse. The
algorithm is a type of unsupervised competitive learning
algorithm that uses a form of vector quantization
resembling self-organizing maps (SOMs).</p>
        <p>
          Temporal Memory (TM) forms connections from the
cells active in the current step to cells that were active
just prior and makes predictions. The algorithm uses
Hebb’s rule, where connections are formed between
cells that were previously active. Through the
formation of those connections a sequence may be learned.
The TM can then use its learned knowledge of the
sequences to form predictions.
The explicit mathematical description of computations for
the HTM network can be found in Mnatzaganian et al.
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. It describes all aspects of the spatial pooler, a
critical learning component in HTM, under a single unifying
framework. The primary learning mechanism is explored,
where a maximum likelihood estimator for determining
the degree of permanence update is proposed.
        </p>
        <p>The HTM algorithm, in essence, allows to take an array
of bits as an input and to predict the input of subsequent
steps. Such predictions may be enough for some tasks,
however, in order to detect anomalies, we need to extract
the occurring differences in the patterns presented in the
inputs.</p>
        <p>Let’s define vector xt as the input generated in our
system at time t. Then the sequence of inputs to the algorithm
can be defined as x1; : : : ; xt 1; xt ; xt 1; : : : ; xk; : : : possibly
continuing until the detection task is stopped manually.</p>
        <p>
          In general, the goal of streaming anomaly detection is
to find abnormalities in inputs as soon as they occur. Such
a real time constraint means that for the purposes of an
anomaly detection at time t the inputs of earlier times
(1; :::; t 1) are accessible only, not the input at time t + 1
or later. The HTM algorithm is capable of detecting such
anomalies as well by using a method described in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>Let’s define two variables to explain the HTM anomaly
detection algorithm: Let a(xt ) be the sparse binary
representation of the input vector xt , defined by the binary 2D
matrix of all cells within a region, where ai; j(xt ) is set to
1 if the ith cell of the jth column is in the active state and
to 0 otherwise. Let p(xt ) be the sparse binary
representation of the prediction of the next input - a(xt+1), defined
by the binary 2D matrix of all cells within a region, where
pi; j(xt ) is set to 1 if the ith cell of the jth column is in the
predictive state and to 0 otherwise. The values of the
prediction matrix are greatly influenced not just by the input
itself, but the context as well.</p>
        <p>Therefore the accuracy of the algorithm prediction is
largely dependent on its ability to model the data. These
two variables are calculated at each step of the algorithm,
however, they do not contain sufficient information to find
anomalies by themselves. Instead, we use these variables
to compute a raw anomaly score for each timestamp,
labelled st . The raw anomaly score essentially gives us a
measurement of a deviation between the predicted and the
actual input. The raw anomaly score is given by:
st =
p(xt 1)a(xt )</p>
        <p>ja(xt )j</p>
        <p>Both variables are binary vectors, the multiplication is
the inner product, divided by the size of the output. The
less the prediction was correct the larger our anomaly
score is. The value of st is therefore a scalar value between
0 and 1, 0 meaning the prediction was perfect, 1 meaning
nothing has been correctly predicted. A weak prediction
would therefore be indicative of an anomaly, however it
does not take into account the predictive capability of our
network on the amount or noise present in the text.</p>
        <p>To counteract this, we calculate the distribution of
anomaly scores within a certain time window, and
therefore find the likelihood instead of simply tresholding the
raw anomaly score. The anomaly likelihood metric is
designed to measure a change in predictability, rather than
change in the input pattern and thus it accounts for the
beginning of the text where predictability is low. The metric
is ideal for detecting not only the starting points of the
anomalies but their ending points as well (since in that
case the predictability would suddenly get much more
accurate).</p>
        <p>To calculate the anomaly likelihood metric, we use a
large moving window represented by the vector W that
Q(x) = p
1
2p</p>
        <p>Z ¥
x
mt =
åii==W0 1 st i
k
;
stores the last k raw anomaly scores. In addition, we
define the W 0 window with a size of j which is much smaller
than k that is used to calculate a small moving average
of the last few anomaly scores. It makes a more accurate
comparison metric than a single score. We calculate the
anomaly likelihood using the Q-function (the tail
distribution function to the Gaussian normal distribution
function), where the mean and variance of raw anomaly scores
for the normal distribution function are recalculated every
time using the values of anomaly scores in our
windowsized memory.</p>
        <p>The anomaly likelihood metric at time t is defined as the
complement of the tail probability:
where:</p>
        <p>Lt = 1</p>
        <p>Q
(1)</p>
        <p>The new developed system combines the previously
described methods and works in two steps:
1. Finding the locations of the text changes. The first
step is to create an algorithm that can find the exact
locations where the style of the text changes, whether
from non-plagiarized to plagiarized or vice-versa. We
consider sentences the smallest unit of text to be
analyzed this way as we do not expect sentences that have
both anomalous and non-anomalous parts within them.
Therefore finding the locations of text changes in
practical terms means finding the offsets of the first
sentence of the anomalous part and the first sentence of
the non-anomalous part.</p>
      </sec>
      <sec id="sec-4-2">
        <title>2. Filtering out the non-anomalous potential se</title>
        <p>quences. The second step is to take the offsets of the
first step and filter out the sections located between two
offsets that are not actually anomalous.
5.1</p>
      </sec>
      <sec id="sec-4-3">
        <title>Finding the Locations of Text Changes</title>
        <p>We have created the following features for the purpose of
determining where such points of change lie:
F1. The anomaly likelihood score of each sentence.
F2. The cosine similarity of the Doc2Vec vectors and
their autoencoder predictions.</p>
        <p>F3. The Euclidean distance between the current
fingerprint and the fingerprint of preceding sentences.
F4. The cosine similarity between the current fingerprint
and the fingerprint of preceding sentences.</p>
        <p>F5. The Jaccard index between the current fingerprint
and the fingerprint of preceding sentences.</p>
        <p>F6. The average relational frequency of the words
contained within a sentence.</p>
        <p>F7. The lowest relational frequency of the words
contained within a sentence.</p>
        <p>F8. The highest relational frequency of the words
contained within a sentence.</p>
        <p>F9. The difference of the mean relational word frequency
of the sentence and the text.</p>
        <p>To calculate F1, we use the semantic fingerprint method
to create a fingerprint of each sentence and use them as
inputs to the HTM network described in Section 3 and then
calculate the anomaly likelihood score.</p>
        <p>
          To calculate F2, we use Doc2Vec, an embedding
method described in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to create a vector representation
of each sentence of our text. We then use it as inputs to
train an autoencoder to first create an encoding of lesser
dimensionality and then reconstruct an output from the code
that best matches the input. After training the autoencoder,
we calculate the cosine similarity between the input
embedding and the reconstructed output embedding. Since
most of the text is expected to come from a single source
with occasional anomalous parts inserted, the network,
through training, creates generalized reconstructions that
have more success reconstructing the inputs from the
original text than from the anomalous parts.
        </p>
        <p>To calculate F3, F4 and F5, we use semantic
fingerprints, comparing the fingerprint of each sentence to the
merged fingerprint of the preceding 5 sentences in a
window (determined to be the average anomaly length). We
use 3 different metrics of comparison: Euclidean distance,
cosine similarity and Jaccard distance. The metrics are
suited to detecting the first sentence of an anomalous
section as well as the first sentence of a non-anomalous
section that follows an anomalous section.</p>
        <p>
          To calculate F6, F7 and F8, we use relational frequency
metrics described in [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], which measure how specific a
current word is to a given segment. We calculate the
frequency of the most and least frequent word as well as the
mean frequency of all words in a sentence. Anomalous
sentences are presumed to have low relational frequency
scores due to having words atypical for the rest of the text.
A low mean relational frequency means that a sentence
contains many unique words. A low highest relational
frequency means that unique stopwords are used. A low
lowest relational frequency means that a sentence unique
word is present (which might not be by itself indicative of
an anomaly).
        </p>
        <p>
          To calculate F9, we use the the mean relative frequency
of the author style metric described in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. We calculate
the frequency of each word in the sentence as well as in the
document. We average these values across all of the words
and compare the mean value of the sentence frequencies
and the document frequencies to get the difference in
author styles.
        </p>
        <p>
          We use a Gradient Boosting Classifer (GBC), described
in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], to combine the predictive capability of the
aforementioned features as well as evaluate their relative
importance. The GBC is an ensemble of multiple models,
specifically decision trees, which has superior prediction
capability compare to the individual models. It does so by
iteratively adding models to the combined model that
minimize the residual loss obtained by the combined model in
the previous step.
        </p>
        <p>To label the dataset, we choose a rather simplistic
method: We label all sentences as zeroes except the first
sentence in each anomaly and the first sentence after the
anomaly ends. Our prediction of these division points is
then used to make the prediction about where the potential
anomalous sections might be.
5.2</p>
      </sec>
      <sec id="sec-4-4">
        <title>Filtering out Non-anomalous Potential Sequences</title>
        <p>Now that we have our division points, we still need to
identify which sections lying between two points are
anomalous and which are not. If we merely tagged them in an
alternating fashion, it would lead to a large number of
errors, as a single faulty division point could mean that we
misclassify our entire dataset. We need some way of
determining the anomalous nature of individual sections. We
can assume that most anomalous sections are relatively
short compared to non-anomalous sections. We can pair
up indices of division points that are located close to one
another, specifically 50 sentences from each other. While
we can create possible pairings of anomalous parts that
are located close to one another, there are individual
division points that cannot be paired. They might be a false
positive or they corresponds to beginning or ending that
has not been found yet. For each isolated index, we
construct multiple artificial sections that are created varying
distances before or after it. The exact process is described
in algorithm 1.</p>
        <p>Such pairings inevitably lead to false positives,
therefore it is vital to devise a method that can recognize
genuine anomalous parts.</p>
        <p>We use a gradient boosting regressor, to predict the
anomalousness of the potential section (the percentage of
anomalous sentences from the section).</p>
        <p>There are multiple parameters that we use as inputs,
some from the previous step and others from modified
algorithms that deal with sections instead of sentences.
Input values for the regressor are the following:
Algorithm 1: Algorithm for creating potential
anomalous sections from division points
Data: P - Sorted list of division points, where the
value of each point is the index of the
sentence within the document
Result: S - List of potential anomalous sections
i 1
while i length(P) do
if P[i] P[i 1] 50 then</p>
        <p>S:addSection(P[i 1], P[i])
else
if P[i + 1] P[i] 50 then</p>
        <p>S:addSection(P[i]; P[i + 1])
for j in range(1,10) do</p>
        <p>S:addSection(P[i] 5 j, P[i])</p>
        <p>S:addSection(P[i], P[i] + 5 j)
end
else
end
end
end</p>
        <p>The mean value of F2, F6-F9 for every sentence of
the section (metrics using Semantic Folding are not
used due to them having high values in using division
points).</p>
        <p>The Euclidean distance between the fingerprint of the
section and the fingerprint of the entire document.
The cosine similarity between the fingerprint of the
section and the fingerprint of the entire document.
The Jaccard index between the fingerprint of the
section and the fingerprint of the entire document.
The average relational frequency of the words
contained within the entire section.</p>
        <p>The lowest relational frequency of the words
contained within the entire section.</p>
        <p>The highest relational frequency of the words
contained within the entire section.</p>
        <p>The difference of the mean relational word frequency
of the section and the text.</p>
        <p>After training our regressor, we treshold the
anomalousness values to obtain the list of anomalous sections. Due
to the artificial creation of sections, overlaps may be
possible. We describe a way to merge these section in algorithm
2.
6
6.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <sec id="sec-5-1">
        <title>Data preparation</title>
        <p>
          We worked with the PAN intrinsic anomaly detection
plagiarism corpus 2011 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] as experimental text data. It is
Algorithm 2: Algorithm for merging overlapping
sections
Data: S - List of potential anomalous sections that
are defined by the index of their starting
sentence and the index of their last sentence
Result: S - List of potential anomalous sections
without overlaps
OverlappingSections T RU E
while OverlappingSections do
        </p>
        <p>OverlappingSections FALSE
for i in range(0, length(S) 1) do
for j in range(i, length(S)) do
if overlaps(S[i], S[ j]) then</p>
        <p>OverlappingSections T RU E
f irst = min(S[i]: f irst; S[ j]: f irst)
last = max(S[i]:last; S[ j]:last)
S:addSection( f irst; last)
S:removeSection(S[i])</p>
        <p>S:removeSection(S[ j])
end
end
end
end
a collection of 4753 larger text bodies in the English
language, made up of various topics, where some texts have
plagiarised parts artificially inserted into them. We have
chosen 40 texts from the corpus to test our system.</p>
        <p>For each text, we have two files: a.txt file containing
the texts themselves and an.xml file containing various
metadata such as the source of the base text and the author
and most importantly: the list of anomalous parts defined
by the division points and the length of such a part. The
number of plagiarized parts in texts is not constant and can
contain 0, with no plagiarism inserted, up to 10 anomalous
parts. Anomalous parts are whole sentences or entire
paragraphs. Thus we only consider entire sentences or larger
collections of sentences as possible anomalies.</p>
        <p>First we remove all stop words from the text, such as a,
the, is, at, which, on etc., since these characters have little
relevance upon the meaning or authorial style of the text.
We solved two tasks as follows:</p>
        <p>Task 1: To find out if the suspicious text is anomalous:
Here, we are only trying to determine whether a text
contains any anomalies or not, ignoring the location or the
number of anomalies present.</p>
        <p>Task 2: To identify the individual anomalous parts
within the text correctly, taking into consideration their
precise locations.</p>
        <p>
          We first split the sentences using the NLTK (Natural
Language Toolkit) tokenizer, described in [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], marking
certain words ending with a punctuation mark as
exceptions when splitting (such as st., mr., mrs., dr. etc.). We
removed also all non-text characters, all stop words as
well as words shorter than 3 letters. Finally, we also
merged sentences that consist of only a single word with
the first non-single word sentence that comes after it. Such
a way of merging allows us to avoid a lot of false positives
which would occur by having sentences that have very
little meaning on their own. We also store which sentences
they were merged from and apply the features calculated
to all of the sentences.
        </p>
        <p>We then calculate the features for each sentence. For
the merged fingerprints of the preceding sentences used
for F1, F3, F4 and F5, we have chosen to merge the
fingerprints of five sentences that came before the
current sentence. In case of the first sentence, we do not
have a merged fingerprint, thus we use the current
fingerprint as the input to the Temporal Pooler as is. In a
case where there are less than 5 sentences available we
use a merged fingerprint consisting from sentences that are
available. Then, in case of F1, as was described, we create
a fingerprint where only those bits have a value of 1 that
are active in both the merged print and the current print
for each sentence. The fingerprints arrays have a size of
128 128 and use the standard English associative
dictionary of cortical.io. As the fingerprints are encoded as
a collection of active bits, we have a list of sorted indices
that are between 1 and 16384.</p>
        <p>To increase computational efficiency, we split the list
into four parts, each having a portion of the values in the
following way: the first list contains all of the values
between 1 and 4096, the second between 4097 and 8192,
the third between 8193 and 12228 and the fourth between
12229 and 16384. We then create sparse vectors from
these lists, by having 1 in an index that is present in the
list and 0 everywhere else. Such a division does not pose
much of a problem for the prediction capabilities of HTM
algorithm, as it has a feature that allows it to separate
individual patterns that represents a sequence of inputs that
still belongs to a single object.
6.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>HTM Network and Gradient Boosting Classifier</title>
      </sec>
      <sec id="sec-5-3">
        <title>Training</title>
        <p>We first fit the HTM network using the entire text once as
the training set and then run the same inputs through the
network to get our predictions. We calculated the anomaly
scores and from them the likelihood scores for each
sentence of the document. For the moving window of mean
anomaly scores, we used a window size of 5 that stores the
previous 5 anomaly scores. For F2, we have used Doc2Vec
to create 100 dimensional vectors from each sentence. We
have chosen a five layered feed-forward perceptron as our
autoencoder network, with the layer sizes being
(experimentally chosen): 100, 50, 20, 50, 100 with the first and
last layer being the same size as the Doc2Vec inputs. We
trained the network on the text in 5 epochs
(experimentally chosen), then performed the reconstruction for each
sentence and calculated the cosine similarity between the
input and the reconstruction.</p>
        <p>Features F3-F5 are calculated just as described in
Section 5, using the five sentences preceding the current
sentence to form our merged fingerprint and comparing it to
the fingerprint of the current sentence. Features F6-F9 are
calculated just as described in Section 5, with calculating
the relational frequencies and author style separately for
every sentence on a text that did not have short words or
stop words removed.</p>
        <p>We used these features as the input parameters to our
Gradient Boosting Classifier with 200 estimators and a
maximum depth of 4. After training our classifier, we can
get the positive examples of sentence offsets and construct
the potential sections from them. To filter them out, we
train a Gradient Bossting regressor with 200 estimators
and a depth of 4 to predict their relevance. We then
threshold the prediction to get the potential anomalous parts.
We experimented with multiple thresholds and found the
threshold value of 0.7 as sufficient. Finally, we merge the
potential sections and then compare them with the actual
anomalous sections using multiple metrics, such as
precision, recall and accuracy to measure our success. For
evaluation, we consider the anomalies as positive
examples and label every single character of the full unmodified
text, therefore taking into account the length of sentences
as well.
6.3</p>
      </sec>
      <sec id="sec-5-4">
        <title>Experimental results</title>
        <p>
          We have evaluated the predictions of anomalies made by
our model and organized the results into Table 1. For
comparison, we have implemented the Author style algorithm
proposed by Kuznetsov et al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] and evaluated it on our
dataset as well. The results have also been organized into
Table 1, alongside the results of our own algorithm.
        </p>
        <p>
          The columns of the table are arranged in the following
manner: The Txt column corresponds to the id number of
the text. The Plag det column tells us how many anomalies
did our algorithm detect fully out of the number of
plagiarisms present. The T1 (Task 1) column tells us the
conclusion our algorithm reached when determining whether
the text is anomalous or not (N for non anomalous, Y for
anomalous). Columns Pre, Rec and Acc refer respectively
to the precision, recall and accuracy values achieved by
our algorithm with classifying anomalies. Columns Pre
A2, Rec A2 and Acc A2 are the results achieved by [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. If
there are no predicted anomalous parts/sentences, the
precision metric has a values N/A since calculating the
metric becomes meaningless. Likewise if there are no actual
anomalous parts/sentences present in the text, the recall
metric similarly has a value of N/A.
        </p>
        <p>As we can see in Table 1, our algorithm has achieved
an accuracy of 100 % in T1. Regarding T2, the overall
percentage of fully found plagiarisms is 70.15 %. The
results show high accuracy due to the far greater number of
non-anomalous sections in the actual text as well as high
precision due to the number of false positives being
further by filtration. The downside of such extensive filtering
can be seen in the recall values however, where the results
vary. Such a discrepancy between the precision and
recall values can be observed with both algorithms. When it
comes to accuracy, the results of the author style algorithm
have only been better in case of text 11, equal in multiple
cases and inferior to ours in most.</p>
        <p>When looking at the metrics of precision and recall, they
tend to achieve a better, sometimes perfect precision score,
but this usually comes at the cost of a significantly worse
recall and accuracy score. While precision is important,
we believe that for recommender systems, it is better to
create a few false positives in order to flag most of the
anomalous sections, rather than being more sure with the
anomalousness of fewer sections.</p>
        <p>As for the performance of the author style algorithm
on T1, it failed to achieve 100 % accuracy, due to often
predicting sentences to be anomalous when there are no
anomalies to be found (such as texts 5 or 10 in Tale 1)
or not finding any anomalies despite the text containing
them (such as texts 23 or 28). It appears that our results
mostly surpass the results generated by the author style
algortihm, therefore showing that sequential anomaly
detection is worth consideration.</p>
        <p>However the disparity in precision is still not ideal, and
lovering our threshold did not give us much improvement
in recall in comparison to the decrease in precision. We
believe this is occuring because of short anomalous
sections, because our artificially constructed anomalous
sections are at least 5 sentences long, which might be more
than the minimum length of said anomalies. We can see
that a lot more depends on the division point detection part
of the algorithm, therefore we have decided to plot the
importance of features for our classifier, showed in Figure 4.</p>
        <p>As we can see, the most successful feature used was F9,
the mean relational word frequency comparison between
the whole text and the sentence. The anomaly likelihood
is very close, being designed specifically to detect points
of change. We can see that using Semantic Folding
without a strong predictor like HTM decreases the performance
as evidenced by the relative uselessness of the fingerprint
comparison metrics (Euclidean, Cosine and Jaccard).</p>
        <p>The metrics that determine the most unique sentences
instead of division points are also quite important in the
prediction as demonstrated by the usefulness of the cosine
similarity of the Doc2Vec vectors as well as the mean
relational frequency of words. In case of the lowest frequency
metric, there may easily be sentence-unique words in
sentences that are from non-anomalous parts and sentences
from anomalous parts that do not have sentence-unique
words. As for the highest frequency, while some
anomalous sentences might not have the same stop words as the
rest of the sentences, others might, making it not all that
reliable.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We have developed a method capable of detecting
intrinsic anomalies in natural text based on sequential analysis
of the syntactic and semantic patterns exhibited by
potentially anomalous text. The algorithm consists of the two
step process of identifying the location of the starting and
ending sentences of anomalies and determining whether a
section located between two such points is anomalous. For
both of these steps we use gradient boosting to form a
prediction out of various metrics extracted from the text using
the Semantic Folding algorithm, HTM network, Doc2Vec,
autoencoders and metrics based on word frequencies.</p>
      <p>
        The algorithm was tested and evaluated on 40 English
texts with artificially inserted plagiarisms from the PAN
intrinsic anomaly detection plagiarism corpus 2011. Our
objective was twofold, the first, to determine whether the
text contains any anomalies at all and the second, to
determine the exact number and position of the anomalous
passages. The algorithm achieved an accuracy of Task 1
was 100 % and better-than-expected results in Task 2,
depending on the text with relatively high values of precision
and accuracy, but varying recall. The overall percentage of
fully found plagiarisms within the text was 70.15 %. We
found that our algorithm was able to improve on the
results of Kuznetsov et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. While there is much room
for improvement, we believe the paradigm of continuous
analysis in plagiarism detection to be an interesting and
valid one that provides non-dismissible results and might
be a step in the right direction when solving similar
problems.
      </p>
      <p>Acknowledgements. The research is supported by
the Slovak Scientific Grant Agency VEGA, Grant No.
1/0177/21 “Descriptional and Computational Complexity
of Automata and Algorithms”.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.: PAN</given-names>
          </string-name>
          <string-name>
            <surname>Plagiarism Corpus 2011 (PAN-PC-</surname>
          </string-name>
          11), DOI 10.5281/zenodo.3250095, (
          <year>2011</year>
          ) http://www.uniweimar.de/en/media/chairs/webis/corpora/panpc-11/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Almarimi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrejková</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Anomaly Searching in Text Sequences CEUR</article-title>
          ,
          <source>ISSN 1613-0073</source>
          , Vol-
          <volume>2046</volume>
          urn:nbn:de:
          <fpage>0074</fpage>
          -
          <lpage>2046</lpage>
          <source>-8, Proceedings of the 11th Joint Conference on Mathematics and Computer Science Eger, Hungary, May 20-22</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mikolov</surname>
          </string-name>
          , T.:
          <article-title>Distributed Representations of Sentences and Documents</article-title>
          .
          <source>Proceedings of the 31st International Conference on Machine Learning</source>
          , PMLR
          <volume>32</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          , (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>De</given-names>
            <surname>Sousa Weber</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Semantic Folding (Theory and its Application in Semantic Fingerprinting</article-title>
          .
          <source>White paper, Version</source>
          <volume>1</volume>
          .
          <issue>2</issue>
          ,
          <fpage>1</fpage>
          -
          <lpage>59</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Purdy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lavin</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          : Biological and
          <string-name>
            <given-names>Machine</given-names>
            <surname>Intelligence</surname>
          </string-name>
          (BAMI), http://numenta.com
          <article-title>/ business-strategy-and-ip/</article-title>
          .
          <source>Release 0</source>
          .
          <fpage>4</fpage>
          , (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          :
          <article-title>Outlier analysis</article-title>
          . Springer Science+Business Media New York, (
          <year>2013</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -47578-3
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Chandola</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , V.:
          <article-title>Anomaly detection: A survey</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          ), (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Zhuang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , Ch.,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaplan</surname>
          </string-name>
          , L. and
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          .:
          <source>Identifying Semantically Deviating Outlier Documents. Proceedings of the conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2748</fpage>
          -
          <lpage>2757</lpage>
          , Copenhagen, Denmark, September 7-
          <issue>11</issue>
          ,
          <year>2017</year>
          . c 2017 Association for Computational Linguistics
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Kannan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Woo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>C. C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Park</surname>
          </string-name>
          , H.:
          <article-title>Outlier Detection for Text Data: An Extended Version</article-title>
          . arXiv:
          <volume>1701</volume>
          .01325v1, (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Why neurons have thousands of synapses, a theory of sequence memory in neocortex</article-title>
          .
          <source>Frontiers in Neural Circuits</source>
          <volume>10</volume>
          (
          <issue>23</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klukas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Purdy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A framework for intelligence and cortical function based on grid cells in the neocortex</article-title>
          .
          <source>Frontiers in Neural Circuits</source>
          , Vol.
          <volume>12</volume>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>S. F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Purdy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Real-Time Anomaly Detection for Streaming Analytics</article-title>
          . arXiv:
          <volume>1607</volume>
          .02480v1, (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>S</given-names>
          </string-name>
          , Lavina,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Purdy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Agha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          :
          <article-title>Unsupervised real-time anomaly detection for streaming data</article-title>
          .
          <source>Neurocomputing</source>
          ,
          <volume>267</volume>
          (
          <year>2017</year>
          ),
          <fpage>134</fpage>
          -
          <lpage>147</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Surpur</surname>
          </string-name>
          , Ch.,
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A comparative study of HTM and other neural network models for online sequence learning with streaming data</article-title>
          ,
          <source>2016 International Joint Conference on Neural Networks (IJCNN)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1530</fpage>
          -
          <lpage>1538</lpage>
          , doi: 10.1109/IJCNN.
          <year>2016</year>
          .772738
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Hole</surname>
            ,
            <given-names>K. J.</given-names>
          </string-name>
          :
          <article-title>Anomaly Detection with HTM</article-title>
          .
          <article-title>Chapter 12 in the book: Anti-</article-title>
          fragile
          <source>ICT Systems</source>
          , Springer 2016
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Young</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hazarika</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poria</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
          </string-name>
          , E.:
          <source>Recent Trends in Deep Learning Based Natural Language Processing</source>
          .
          <year>2018</year>
          , arXiv:
          <fpage>1708</fpage>
          .02709v8 [cs.CL]
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Mnatzaganian</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fokoué</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kudithipudi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>A Mathematical Formalization of Hierarchical Temporal Memory's Spatial Pooler</article-title>
          . arXiv:
          <volume>1601</volume>
          .06116v3, (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Oberreuter</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          'Huillier,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ríos</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Velásquez</surname>
          </string-name>
          , J.:
          <article-title>Approaches for Intrinsic and External Plagiarism Detection</article-title>
          ,
          <source>Notebook for PAN at CLEF</source>
          <year>2011</year>
          , (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motrenko</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsova</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Strijov</surname>
          </string-name>
          , V.:
          <article-title>Methods for intrinsic plagiarism detection and author diarization</article-title>
          ,
          <source>Notebook for PAN at CLEF</source>
          <year>2016</year>
          , (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Natekin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Knoll</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Gradient boosting machines, a tutorial, Frontiers in Neurorobotics www</article-title>
          .
          <source>frontiersin.org December</source>
          <year>2013</year>
          , Volume
          <volume>7</volume>
          ,
          <string-name>
            <surname>Article</surname>
            <given-names>21</given-names>
          </string-name>
          , DOI 10.3389/fnbot.
          <year>2013</year>
          .
          <volume>00021</volume>
          , (
          <year>2013</year>
          )
          <article-title>Article 121</article-title>
          , (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>NLTK: The Natural Language Toolkit</article-title>
          . CoRR, cs.CL/0205028., (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>