<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The problem of fuzzy duplicate detection of large texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>E V Sharapova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R V Sharapov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vladimir State University</institution>
          ,
          <addr-line>Orlovskaya street 23, Murom, Russia, 602264</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>270</fpage>
      <lpage>277</lpage>
      <abstract>
        <p>In the paper, considered the problem of fuzzy duplicate detection. There are given the basic approaches to detection of text duplicates-distance between strings, fuzzy search algorithms without indexing data, fuzzy search algorithms with indexing data. Thereview of existing methods for the fuzzy duplicate detection is given. The algorithm of fuzzy duplicate detection is present. The algorithm of fuzzy duplicate texts detection was implemented in the system AVTOR.NET. The paper presents the results of testing of the system.The use of filtering text, stemming and character replacement, allow the algorithm to found duplicates even in minor modified texts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Today computer technology and the World Wide Web have become part of our lives. Billions of users
use the Internet every day. The Internet contains a huge number of documents. Many books are
transformed in a digital form. Users can read books, articles, newspapers directly from computers and
electronic gadgets. Today, in the Internet there are many studying materials: lectures, textbooks,
methodical literature, etc. There are big collections of essays, course and diploma projects and
scientific dissertations. It leads to problem un-controlling coping of information.</p>
      <p>The text of document may be a part of another document, formed from several documents, may be
modified part of another document (with a change of words to synonyms, endings, times, etc.). In this
way, the document is fuzzy duplicate of other documents.</p>
      <p>There is the problem of text checking to presence of fussy duplicates. The purpose of work is the
algorithm creation of fuzzy duplicate detection.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methods of the duplicate texts detection</title>
      <sec id="sec-2-1">
        <title>2.1. The distance between strings</title>
        <p>The task of comparing texts reduced to comparing their strings. For this reason, assume that the text
document is aone string of a large length. There is a requirement to determine the measure of the
string similarity. This measure is called the distance between strings.</p>
        <p>
          The distance is the minimum number of string characters changes, which is necessary for the
transformation of one string within another. The Hamming and Levenshtein algorithms used to
determine the distance [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          (a) The algorithm of Hamming [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          The algorithm of Hamming is used to search the distance between strings of equal length, by using
the operation "replacement". This algorithm is not suitable to search the distance between the strings,
various by length.
(b) The algorithm of Levenshtein [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
        </p>
        <p>The algorithm of Levenshtein uses operations "replacement", "insert", "delete". They allow to
search the distance between strings, different by length. But time of calculation of distance between
strings is disproportionately increases with increases of strings size. Therefore, the use of this
algorithm is only suitable for comparing multiple pages of documents.</p>
        <p>
          (c) The algorithm of Damerau-Levenshtein [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>This is modification of the Levenshtein distance. It is also consider the operation of "transposition"
the next two letters.The method of calculation of Levenshtein distance was the basis for several
algorithms of search for common subsequences.</p>
        <p>
          (d) The algorithm of Wagner and Fisher [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>This method is based on calculating the Levenshtein distance between the strings prefixes (substrings).
The matrix of editorial prescription is made. It contains a summary value of Levenshtein distance
(minimum weight operations to change characters). The size of editorial prescription matrix is
(p+1)∙(b+1), where p and b – compared strings prefixes.</p>
        <p>
          The number of string comparisons is k∙p∙b, where k – coefficient (for natural language k=0,2). The
complexity of algorithm is O(p∙b) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].This method is the easiest way to create of editorial prescription.
(e) The algorithm of Masek and Paterson [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>The algorithm is modification of Wagner and Fisher algorithm with applying the approach
Arlazorov, Diniz, Kronrod and Faradjev. The distance matrix in it is separated into submatrixes with
edges computed in accordance with contiguous to their submatrixes.</p>
        <p>The complexity of algorithm is О(k∙(p∙b/log(p))).</p>
        <p>
          This algorithm faster than the algorithm of Wagner and Fischer, but, at the direction of the authors
themselves [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], it reaches real speed only when comparing very long strings.
        </p>
        <p>
          (f) The algorithm of Ukkonen [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>For this algorithm it is required the construction of suffix tree of strings set to minimize search time
in substring.</p>
        <p>Complexity of algorithm is O(k∙m), where m – maximum vertex depth of suffix tree.</p>
        <p>Algorithm of Ukkonen is more applicable for search of exact matches strings. If searching for very
different texts, then working time is greatly increased.</p>
        <p>
          (g) The algorithm of Hirschberg [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>The algorithm is modification of Wagner and Fisher algorithm. It is not calculated the distance
between the strings but the distance between the longest common subsequence.</p>
        <p>
          The complexity of algorithm is O(k∙(p+b)).
(h) The algorithm of Hunt and Szymanski [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>It is based on searching of maximum increasing path on a matching elements graph of compared
strings.</p>
        <p>The complexity of algorithm isO(k∙(g+b)∙log(b)), where g – the number of positions in which the
string symbols are the equal.</p>
        <p>Under certain circumstances, this algorithm shows good results, but in other cases, the comparison
time is quadratic.</p>
        <p>
          (i) The suffix tree [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>This method is proposed of McCreight in 1976. It is similar to the Ukkonen algorithm, but the
suffixes are added in reverse order.</p>
        <p>The complexity of algorithm is O(k∙m).</p>
        <p>Disadvantage of algorithms working with suffix trees is use a fixed alphabet. It is suitable only for
search of exact match strings.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Fuzzy search algorithms without indexing data</title>
        <p>
          Let’s consider fuzzy search algorithms without data indexing [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. They are used to search for the
previously unknown and unmanufactured texts.
        </p>
        <p>
          (a) Linear search algorithm [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>The algorithm uses the distance metric and applies it to the text words. Efficiency of the method
depends on the number of errors and mismatches of texts. More numerous they are the more increases
the comparison.</p>
        <p>The algorithm complexity is O(s∙p), where s – the number of errors made when checking, p – the
text length.</p>
        <p>
          (b) The Bitap algorithm [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>This algorithm is applied more than the linear search. In view modifications it calculates the
Levenshtein distance between words. At normal conditions, speeds of these two algorithms are the
same. The algorithm complexity is O(s∙p), but the speed of this algorithm significantly higher on long
words than a linear method.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Fuzzy search algorithms with indexing data</title>
        <p>It is necessary to index the text for the application of this group of algorithms. The dictionary is
constructed by thetext. It contain words and their position in the text. On base of dictionary creates the
index required for further search.</p>
        <p>(a) The algorithm of sample expansion.</p>
        <p>This algorithm is composedfor the search of many erroneous words and then look for them in the
dictionary.</p>
        <p>Complexity of the algorithm is O((b  A )s  b  log p) , where A – the dictionary size, b and p –
compared words, s – the number of errors in a word.</p>
        <p>Disadvantage of the algorithm is that with increasing number of errors, the time of its operation
also increase.</p>
        <p>
          (b) The N-gram algorithm [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>This algorithm is based on words separation to parts (substrings) with the length N (N-grams). It is
compiled N-gram list, contain all the words of this text with occurrences of the substring. Next
separated similarly inquiry is searched by this list.Most often used division into trigrams.
Complexity of algorithm isO(w∙H), where w – the probability of occurrence of N-grams in the word, H
– the number of words in the N-gram list.</p>
        <p>
          (c) Hashing by signature [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>This algorithm represents a word of the text in a code ("hash"), consisting of digits. Indexed words
recorded in the table. Query also indexed is make the search in the table.</p>
        <p>Complexity of algorithm is O( E s  p ) , where E – the hash function, representing the word as a
2 E
code, p – the text length, s – the number of errors.</p>
        <p>
          (d) The shingles algorithm [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>The source text is separated into a set of sequences of a defined length (shingles). Documents are
similar if the equal some amount of their shingles. Number of shingles is sufficiently large. Therefor
are used methods reduce of shingles set.</p>
        <p>For example, can be considered only those shingles, whose fingerprint is divided into a defined
number n. Other way are selected shingles with the minimal fingerprint, or simply taken a defined
amount of shingles.</p>
        <p>
          (e) The algorithm of D. Fetterly [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
        </p>
        <p>For each document are computed 84 fingerprint by algorithm of Karp-Rabin using a random
sample of shingles. These shingles are separated into 6 groups, which are called "super shingle". The
combination of 2 super shingle is "mega shingle". 15 mega shingle are computed for the each
document. Two documents are similar if they have the same at least one mega shingle.</p>
        <p>Algorithm complexity is O(S1∙S2), where S1 – number of shingles from document 1, S2 – number
of shingles from document 2.</p>
        <p>
          (f) The I-Match algorithm [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>Dictionary A is built for a group of documents. Dictionary consist of words with the average values
of a code IDF. For each document is created dictionary B and determine the intersection of A and B.
Words of document, included in the intersection, are used to build I-Match signature of the document.
If I-Match signatures are equals, then texts are similar.</p>
        <p>
          (g) The algorithm of key words [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
        <p>From the index are selected "key" words with threshold frequency. For a document constructed
binary vector (signature). Value of i-th coordinate of the vector is equal to 1 if the frequency of i-th
"key" word in the text is more than the threshold value, or 0 if its frequency in the text is less than the
threshold. Texts are similar if they have the same signature.</p>
        <p>
          (h) The TF*IDF algorithm [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>It is a calculated inverse frequency of word in the collection of documents, called as Inverse
Document Frequency (IDF). It is a calculated frequency of the word in the documents called as a Term
Frequency (TF). A Values TF*IDF helps measure similarity documents to query. This is a very
popular method. The disadvantage of this method is that it ignores a relative arrangement of the words
in the documents.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The algorithm of fuzzy duplicate texts detection</title>
      <p>
        At present, there are a large number of approaches for finding similar strings and small texts [
        <xref ref-type="bibr" rid="ref21 ref22 ref23 ref24 ref25">21-25</xref>
        ].
For large texts (from 1000 words), the search for coincidences is difficult due to the large time
expenditure. Therefore, it is of interest to construct an algorithm for finding coincidences in large texts
in a short time.
      </p>
      <p>Let’s consider our algorithm of fuzzy duplicate detection.</p>
      <sec id="sec-3-1">
        <title>3.1. A character set conversion.</title>
        <p>A checked text document can have different encodings. For a Russian language it is win1251,
KOI8R, UTF-8, ASCII. Therefore, the checked text document is converted to the uniform character set
UTF-8.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Converting a text to the single long length string.</title>
        <p>A text document is a set of strings. Formatting of the same documents may be very different.
Therefore, it is necessary to simplify the presentation of the text. For this purpose, we used a linear
representation of the text as a sequence of characters. In other words, the text appears as a single long
length string (an wide string).
3.3. Pre-processing of a text.</p>
        <p>It is perform pre-processing of a text: the text of a document is replaced by a filtered copy. For this
purpose the following steps are performed:
 Removal of HTML tags.</p>
        <p>A text document can to contain HTML tags. Because HTML tags are used for a document
formatting, they do not affect to its contents. Therefore presence of HTML tags will just interfere to
checking and must be removed from a document.</p>
        <p> Removal of punctuation marks and special characters.</p>
        <p>In the text there are punctuation marks and special characters, including tabs, new line, etc. All this
characters are removal from the document.</p>
        <p> Case conversion.</p>
        <p>In the text of documents many words can be written with a lowercase or uppercase letters, consist
entirely of uppercase letters or contain mixed uppercase and lowercase letters in words. Therefore, all
characters are converted to lowercase.</p>
        <p> Processing of replacement characters.</p>
        <p>Sometimes, in text documents there are replacement characters. For example, in Russian words
several letters of the Russian alphabet replaced the similar Latin letters (such as “o”, “a”, “e”, “y”, “x”,
“c”, “p”, etc.). In this case, it is processed the replacement characters and they transform to an original
form.</p>
        <p> Removal of stop words.</p>
        <p>Stop words are common words occurring in an almost every document, such as "the", "of", "on",
"a", "at", "by" etc. Therefore, these words do not help in the search of text duplicates, but increase the
processing time. They must be removed from the document.</p>
        <p> Filtration of the text (not informative words removal).</p>
        <p>A filtering of a text is a removal the most frequent words, not informative words etc. Also, filtering
words that contain special characters, digits, words of great length, etc. This procedure allows to
asignificantly reducing the amount of a computation (the length of the checked text).
 Stemming (processing of word endings).</p>
        <p>A stemming is processed closure words. In this case, they are simply deleted. This avoids the effect
of such modifications of the text, as a change in the singular and plural forms, masculine and feminine,
a present and a past tense etc.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Detection of fuzzy duplicates for a document.</title>
        <p>Our algorithm is based on a method of shingles. The text is divided into a sequence of words. A length
of the sequence is fixed (5 words). For the each sequence of words is computed a MD5 code
(shingles). The algorithm works with that code (figure 1).</p>
        <p>A classical method of shingles used all words to create a sequence of words. We propose to
consider not all the text of document but its a processed and filtered copy.</p>
        <p>Tobe, or not tobe: thatisthequestion.
0baf14642840e122b781165e0a570ace
eba8ebc8628706f53d89f2de5475f87d</p>
        <p>e2e2085e08a720f59b4bc7aed2680a46</p>
        <p>The number of matching MD5 codes shows the measure of matching documents. If all codes in two
documents are match, then documents are full duplicates. If there aren’t match codes, then documents
are different. If only a part of MD5 codes is match, documents are fuzzy duplicates.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. The system structure</title>
      <p>
        On the basis of the Vladimir State University, the authors developed the system of fuzzy duplicates
detection. The system checks both the sources available on the Internet and on an internal database
(databases of articles, course works and examinations etc.). The system generates a report with
colouring duplicate parts of texts and viewing found duplicate sources. Let’s consider the system
structure [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] (figure 2).
      </p>
      <p>A text preparation includes removal of HTML tags, removal of punctuation marks and special
characters, case conversion, processing of replacement characters, removal of stop words, filtration of
the text, a stemming.</p>
      <p>The system of fuzzy duplicates detection includes two modules. An each module operates
independently of each other.</p>
      <p>The first module checks the internal database sources. Base sources include a database of articles,
course works and examinations, diploma projects, textbooks, lecture courses. Sources are stored in the
form of a full-text and in the form of specially organized search index. A search index is needed for
quick checking duplicates of a text in a database sources. There is no need for each check to a view all
the available texts and make them fairly laborious process. All the necessary information for search is
already included in a structured search index. A search index generated from the pre-processing text.</p>
      <p>The second module checks the documents from the Internet. Text of the document is divided into
informative parts. The number of parts depends on the size of the document. Next, using a search
engine is searched for a sources containing these informative pieces. A search module use
Yandex.XML, Yandex.com, Google.ru, Rambler.ru, Yahoo.com, Poisk.Mail.ru, Nigma.ru etc.
Received documents are compared with the original document. To do this, the document format is
determined (html-document, txt-file, doc-or rtf-document, pdf-file). From a html-document all markup
tags are removed. The files *.doc, *.docx, *.rtf and *.pdf converted in a plain text format with no
markup. In a next step documents are pre-processed and computed the similarity with the original
document.</p>
      <p>The main requirements for the system are the completeness and accuracy of duplicates detection.
We did not set thetask of reducing a check time.</p>
      <p>
        The algorithm of fuzzy duplicate texts detection was implemented in a system AVTOR.NET [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
The system checks both the sources available on the Internet and on internal database (databases of
articles, course works and examinations, etc.). The system generates a report with colouring duplicate
parts of texts and viewing found duplicate sources.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Practical use</title>
      <p>In 2016 the system Avtor.NET was used to check for a plagiarism of diploma projects full texts in
department of technospheric safety of Vladimir State University. During the 10 days was checked
more than 70 diploma projects. An each work has been from 60 to 120 pages. For this purpose, it was
necessary to get up a new module to reduce the load on the search engines. A system Avtor.NET
coped with the task. All diploma projects have been checked and we have received detailed reports on
the presence of a plagiarism.</p>
      <p>To test the efficiency of the algorithm we were compiled of three group of tests:
 Fussy duplicates of one document with a reordering of the sentence (T1);
 Fussy duplicates with a change in the text of the times and kinds of words (Т2);
 Fussy duplicates of one document with a reordering of the sentence and the addition of the
original text between sentences (Т3);
 Fussy duplicates of some documents with a reordering of the sentence (Т4).</p>
      <p>All the tests had a size about 4000 characters and contained an average of 400 words. We used
collection of essays, available on the Internet, to creating of tests. 10 tests of each groups was
composed.</p>
      <p>We compared the results of systems Antiplagiat, AdvegoPlagiatus, AntiPlagiarism.NET,
Etxt.ru,Content-watch.ru, Text.Rucont.ru, Unicheck.comwith our algorithm (system Avtor.NET). To
assess the quality of detection was used Recall, showing what percentage of duplicate text was really
detected. Precision of all the systems was high and trends to 1.</p>
      <p>All systems correctly handle test T1. Slightly worse result of Etxt.ru (Recall = 0.75) is due of
features comparison algorithm implementation. With the test T2, the systems AdvegoPlagiatus,
Text.Rucont.ru and Avtor.NET coped well. Worst results showed systems Unicheck.com (Recall =
0.32), AntiPlagiarism.NET (Recall = 0.58) and Antiplagiat (Recall = 0.63).
System
Antiplagiat
AdvegoPlagiatus
AntiPlagiarism.NET
Etxt.ru
Content-watch.ru
Text.Rucont.ru
Unicheck.com
Avtor.NET</p>
      <p>T1
0.98
0.99
0.83
0.75
0.90
0.94
0.95
0.99</p>
      <p>Test T3 was well overcome by Text.Rucont.ru, Avtor.NET (Recall = 0.94), Antiplagiat (Recall =
0.89) and Content-watch.ru (0.87).</p>
      <p>Test T4 badly passed only Etxt.ru (Recall = 0.49). All other systems showed very good results.</p>
      <p>How can we see the system Avtor.NET coped well with all four tests. The algorithm copes well
with the replacement of endings, permutations of pieces of text, compilation from several sources and
the alteration of texts with the insertion of new words.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The system Avtor.NET correctly handle all texts and show results that are not inferior and sometimes
exceed the results of existing systems.Thus, it was created the algorithm of fuzzy duplicates detection.
The use of filtering text, stemming and character replacement, allow the algorithm to found duplicates
even in minor modified texts.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Sharapova</surname>
            <given-names>E V</given-names>
          </string-name>
          <year>2014</year>
          <article-title>Analysis of methods and systems for fuzzy duplicate detection Proc</article-title>
          .
          <article-title>of 14 International multidisciplinary scientific Geoconference SGEM2014</article-title>
          . Informatics,
          <source>Geoinformatics and Remote Sensing</source>
          <volume>1</volume>
          <fpage>27</fpage>
          -
          <lpage>33</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Hamming</surname>
            <given-names>R W</given-names>
          </string-name>
          <year>1950</year>
          <article-title>Error detecting</article-title>
          and
          <source>error correcting codes Bell System Technical Journal</source>
          <volume>29</volume>
          <fpage>147</fpage>
          -
          <lpage>160</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Levenshtein</surname>
            <given-names>V I</given-names>
          </string-name>
          <year>1966</year>
          <article-title>Binary codes capable of correcting deletions, insertions</article-title>
          ,
          <source>and reversals Soviet Physics Doklady</source>
          <volume>10</volume>
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Damerau F J 1964</surname>
          </string-name>
          <article-title>A technique for computer detection and correction of spelling errors</article-title>
          <source>Communications of the ACM</source>
          <volume>7</volume>
          <fpage>171</fpage>
          -
          <lpage>176</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Wagner</surname>
            <given-names>R A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Fischer M J 1974</surname>
          </string-name>
          <article-title>The string-to-string correction problem</article-title>
          <source>Journal of the ACM</source>
          <volume>21</volume>
          <fpage>168</fpage>
          -
          <lpage>173</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Gusfield</surname>
          </string-name>
          <article-title>D 2003 Strings, trees and sequences in the algorithms: Computer Science</article-title>
          and Computational
          <string-name>
            <surname>Biology</surname>
          </string-name>
          (Cambridge University Press)
          <volume>654</volume>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Masek</surname>
            <given-names>W J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Paterson M S 1980</surname>
          </string-name>
          <article-title>A faster algorithm for computing string-edit distances</article-title>
          <source>Journal of Computer and Systems Sciences</source>
          <volume>20</volume>
          <fpage>18</fpage>
          -
          <lpage>31</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Masek</surname>
            <given-names>W J</given-names>
          </string-name>
          and
          <string-name>
            <surname>Paterson M S 1983</surname>
          </string-name>
          <article-title>How to compute string-edit distances quickly Time warps, string edits, and macromolecules: the theory and practice of sequence comparison (Addison Wesley</article-title>
          , Reading, MA)
          <fpage>337</fpage>
          -
          <lpage>349</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Hirschberg D S 1975</surname>
          </string-name>
          <article-title>A linear space algorithm for computing maximal common subsequences</article-title>
          <source>Communications of the ACM</source>
          <volume>18</volume>
          <fpage>341</fpage>
          -
          <lpage>343</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Hunt J W and Szymanski T G 1977</surname>
          </string-name>
          <article-title>A fast algorithm for computing longest common subsequences</article-title>
          <source>Communications of the ACM</source>
          <volume>20</volume>
          <fpage>350</fpage>
          -
          <lpage>353</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>McCreight E M 1976</surname>
          </string-name>
          <article-title>A space-economical suffix tree construction algorithms</article-title>
          <source>Journal of the ACM</source>
          <volume>23</volume>
          <fpage>262</fpage>
          -
          <lpage>272</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Boytsov L M 2004</surname>
          </string-name>
          <article-title>Classification and experimental study of the modern algorithms for fuzzy dictionary search</article-title>
          <source>Proc. of 6-th Russian Scientific Conference Digital Libraries: Advanced Methods and Technologies, Digital Collections 10</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Knuth</surname>
            <given-names>D 1997</given-names>
          </string-name>
          <article-title>The Art of Computer Programming (Addison-</article-title>
          <string-name>
            <surname>Wesley</surname>
          </string-name>
          )
          <fpage>396</fpage>
          -
          <lpage>408</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Baeza-Yates</surname>
            <given-names>R</given-names>
          </string-name>
          and
          <string-name>
            <surname>Navarro</surname>
            <given-names>G 1996</given-names>
          </string-name>
          <article-title>A faster algorithm for approximate string matching Proc</article-title>
          .
          <source>Combinatorial Pattern Matching 1-23</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Nagao</surname>
            <given-names>M</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mori S 1994 A New</surname>
          </string-name>
          <article-title>Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese Proc</article-title>
          .
          <source>of the 15th International Conference on Computational Linguistics 611-615</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Broder</surname>
            <given-names>A 1998</given-names>
          </string-name>
          <article-title>On the resemblance and containment of documents Compression and</article-title>
          Complexity of Sequences SEQUENCES'
          <volume>97</volume>
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Fetterly</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manasse</surname>
            <given-names>M</given-names>
          </string-name>
          and
          <string-name>
            <surname>Najork M 2003</surname>
          </string-name>
          <article-title>On the Evolution of Clusters of Near-Duplicate Web Pages Proc</article-title>
          .
          <source>of the First Conference on Latin American Web Congress 37-45</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Kolcz</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdhury</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Alspector J 2004 Improved</surname>
          </string-name>
          <article-title>Robustness of Signature-Based NearReplica Detection via Lexicon Randomization Proc</article-title>
          .
          <source>of KDD 605-610</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Ilyinsky</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuzmin</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Melkov</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Segalovich</surname>
            <given-names>I 2002</given-names>
          </string-name>
          <article-title>An efficient method to detect duplicates of Web documents with the use of inverted index Proc</article-title>
          . of WWW Conference
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Salton</surname>
            <given-names>G</given-names>
          </string-name>
          and
          <string-name>
            <surname>McGill M J 1983</surname>
          </string-name>
          <article-title>Introduction to modern information retrieval (New York: McGraw-Hill) 448</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Mikhaylov</surname>
            <given-names>D V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozlov A P and Emelyanov G M 2015</surname>
          </string-name>
          <article-title>An approach based on TF-IDF metrics to extract the knowledge and relevant linguistic means on subject-oriented text sets</article-title>
          <source>Computer Optics</source>
          <volume>39</volume>
          (
          <issue>3</issue>
          )
          <fpage>429</fpage>
          -
          <lpage>438</lpage>
          DOI: 10.18287/
          <fpage>0134</fpage>
          -2452-2015-39-3-
          <fpage>429</fpage>
          -438
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Mikhaylov</surname>
            <given-names>D V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozlov A P and Emelyanov G M 2016</surname>
          </string-name>
          <article-title>Extraction of knowledge and relevant linguistic means with efficiency estimation for the formation of subject-oriented text sets</article-title>
          <source>Computer Optics</source>
          <volume>40</volume>
          (
          <issue>4</issue>
          )
          <fpage>572</fpage>
          -
          <lpage>582</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2016-40-4-
          <fpage>572</fpage>
          -582
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Evdokimova</surname>
            <given-names>N I</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kuznetsov</surname>
            <given-names>A V</given-names>
          </string-name>
          <year>2017</year>
          <article-title>Local patterns in the copy-move detection problem solution</article-title>
          <source>Computer Optics</source>
          <volume>41</volume>
          (
          <issue>1</issue>
          )
          <fpage>79</fpage>
          -
          <lpage>87</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2017-41-1-
          <fpage>79</fpage>
          -87
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Mikhaylov</surname>
            <given-names>D V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozlov A P and Emelyanov G M 2017</surname>
          </string-name>
          <article-title>An approach based on analysis of ngrams on links of words to extract the knowledge and relevant linguistic means on subjectoriented text sets</article-title>
          <source>Computer Optics</source>
          <volume>41</volume>
          (
          <issue>3</issue>
          )
          <fpage>461</fpage>
          -
          <lpage>471</lpage>
          DOI: 10.18287/
          <fpage>2412</fpage>
          -6179-2017-41-3-
          <fpage>461</fpage>
          - 471
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Sharapova</surname>
            <given-names>E V</given-names>
          </string-name>
          <year>2014</year>
          <article-title>One way to fuzzy duplicates detection</article-title>
          <source>Proc. of 14 International multidisciplinary scientific Geoconference Informatics, Geoinformatics and Remote Sensing</source>
          <volume>1</volume>
          <fpage>273</fpage>
          -
          <lpage>277</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Zelenkov</surname>
            <given-names>Y G</given-names>
          </string-name>
          and
          <article-title>Segalovich I V Comparative analysis of methods for fuzzy duplicate detection for Web-</article-title>
          documents
          <source>Proc. of 9-th Russian Scientific Conference Digital Libraries: Advanced Methods and Technologies, Digital Collections 9</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Sharapova</surname>
            <given-names>E V</given-names>
          </string-name>
          <year>2014</year>
          <article-title>System of fuzzy duplicates detection Applied Mechanics</article-title>
          and Materials 490-491
          <fpage>1503</fpage>
          -
          <lpage>1507</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>