<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Engineering a Tool to Detect Automatically Generated Papers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nguyen Minh Tien</string-name>
          <email>Minh-tien.nguyen@imag.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cyril Labbe´</string-name>
          <email>Cyril.labbe@imag.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Univ. Grenoble Alpes, LIG</institution>
          ,
          <addr-line>F-38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>54</fpage>
      <lpage>62</lpage>
      <abstract>
        <p>In the last decade, a number of nonsense automatically-generated scientific papers have been published, most of them were produced using probabilistic context free grammar generators. Such papers may also appear in scientific social networks or in open archives and thus bias metrics computation. This shows that there is a need for an automatic detection process to discover and remove such nonsense papers. Here, we present and compare different methods aiming at automatically classifying generated papers.</p>
      </abstract>
      <kwd-group>
        <kwd>The field of Natural Language Generation (NLG)</kwd>
        <kwd>a sub field of natural language processing (NLP) has flourished</kwd>
        <kwd>The data-to-text approach [5] has been adopted for many useful real life applications</kwd>
        <kwd>such as weather forecasting [6]</kwd>
        <kwd>review summarization [7]</kwd>
        <kwd>or medical data summarization [8]</kwd>
        <kwd>However</kwd>
        <kwd>NLG is also used in a different way as presented in section 2</kwd>
        <kwd>1</kwd>
        <kwd>While section 2</kwd>
        <kwd>2 presents some of the existing detection methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In this paper, we are interested in detecting fake academic-papers that are automatically
created using a Probabilistic Context Free Grammar (PCFG). Although these kind of
texts are fairly easy to detect by a human reader, there is a recent need to automatically
detect such texts. This need has been highlighted by the Ike Antkare1 experiment [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and other studies [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Detection methods and tools are useful for open archives [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and
surprisingly also important for high profile publishers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Thus, the aim of this paper is to compare the performances of SciDetect2 – an open
source program – with other detection techniques.</p>
      <p>Section 2 gives a short description of fake paper generators based on PCFG and
also provides an overview of different existing detection methods. Section 3 details
detection approaches based on distance/similarity measurement. Section 4 presents a
tuned classification process used by the SciDetect tool. Section 5 shows comparison
results obtained by the different methods for fake paper detection. Section 6 concludes
the paper and makes proposals for future work.
Many SCI PEOPLE would agree that, had it not been for SCI GENERIC NOUN ...
SCI BUZZWORD ADJ SCI BUZZWORD NOUN and SCI THING MOD have garnered LIT GREAT ...
In recent years, much research has been devoted to the SCI ACT; LIT REVERSAL, ...
SCI THING MOD and SCI THING MOD, while SCI ADJ in theory, have not until ...
The SCI ACT is a SCI ADJ SCI PROBLEM XXX
The seminal generator SCIgen3 was the first realization of a family of scientific oriented
text generators: SCIgen-Physic4 focuses on physics, Mathgen5 deals with mathematics,
and the Automatic SBIR Proposal Generator6 (Propgen in the following) focuses on
grant proposal generation. These four generators were originally developed as hoaxes
whose aim was to expose “bogus” conferences or meetings by submitting meaningless,
automatically generated papers.</p>
      <p>
        At a very-quick glance, these types of papers appear to be legitimate with a coherent
structure as well as graphs, tables, and so on. Such papers might mislead naive
readers or an inexperienced public. They are created using PCFG – a set of rules for the
arrangement of the whole paper as well as for individual sections and sentences (see
Figure 1). The scope of the generated texts depends on the generator but they are
typically quite limited when compared to a real human written text in both structure and
vocabulary [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Some methods have been developed to automatically identify SCIgen papers. For
example, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] checks whether references are proper references, a paper with a large
proportion of unidentified references will be suspected as being a SCIgen paper. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
uses an ad-hoc similarity measure in which the reference section plays a major role
whereas [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is based on observed compression factor and a classifier. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] in line
with [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] propose to measure the structural distance between texts. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] proposes a
comparison of topological properties between natural and generated texts, and [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] studies
the effectiveness of different measures to detect fake scientific papers. Our own study
goes further on that track by including untested measures such as the ones used by
ArXiv and Springer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Distance and Similarity Measurements</title>
      <p>In this paper, we are interested in measuring the similarity between documents as a way
to identify specific ones as being automatically generated. Thus, we investigated four
different measures: Kullback-Leibler divergence, Euclidean distance, cosine similarity
and textual distance.
3 http://pdos.csail.mit.edu/scigen/
4 https://bitbucket.org/birkenfeld/scigen-physics
5 http://thatsmathematics.com/mathgen/
6 http://www.nadovich.com/chris/randprop/</p>
      <p>In the following, for a text A of length NA (number of tokens), let FA(w) denote
the absolute frequency of a word w in A (the number of times word w appears in A)
and PA(w) = FA(w) be the relative frequency of w in A.</p>
      <p>NA
Kullback-Leibler divergence: this method measures the difference between two
distributions. Typically, one under test and a true one. Thus it can be used to check the
observed frequency distributions in a text against frequency distributions observed in
generated text. With a text under test B and a set of true generated texts A, the
(nonsymetric) Kullback-Leibler divergence of B from A is computed as follows:
DKL(A; B) = X PA(i) log
i2Sw</p>
      <p>PA(i)
PB(i)</p>
      <p>
        This approach (with Sw a set of stop words found in A) seems to be currently used
by ArXiv. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] shows a principal-component analysis plots (similar to Figure 2) where
computer-generated articles are arranged in tight clusters well separated from genuine
articles.
      </p>
      <p>Euclidean Distance: each documents can be considered as a vector of absolute
frequencies of all the words that appeared in it. Hence, the distance between two documents A
and B is calculated as:
dE(A;B) =
s X (FA(w)
w2A_B</p>
      <p>FB(w))2</p>
      <p>While it is simple to compute, it is often regarded as not well suited for computing
similarities between documents.</p>
      <p>
        Textual Distance[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]: is a method to compute the differences in proportion of word
tokens between two texts. The distance between two texts A and B where NA &lt; NB
is:
d(A;B) =
      </p>
      <p>Pw2A_B jFA(w)</p>
      <p>NA FB (w)j</p>
      <p>NB
2NA
where d(A;B) = 0 means A and B share the same word distribution and d(A;B) = 1
means there is no common word in A and B.
that using textual distance creates a clear separation in distance to the nearest neighbour
between 400 generated papers and genuine ones. This shows that the fake papers form a
compact group for individual type of generator that are clearly separated from genuine
texts (Scigen and Physgen were merged together because of their close relation). Thus,
in the next section, we present our SciDetect system using textual distance and nearest
neighbour classification with custom thresholds.
4</p>
    </sec>
    <sec id="sec-3">
      <title>A Tool to Detect Automatically Generated Paper</title>
      <p>In this section we present our SciDetect system, which is based on inter-textual distance
using all the words and nearest neighbour classification. To avoid mis-classifications
caused by text length, texts shorter than 10000 characters were ignored and texts longer
than 30000 characters were split into smaller parts. To determine the genuineness of
a text, we used different thresholds for each type of generator. We have performed a
number of tests in order to set these thresholds.</p>
      <p>For each generator (SCIgen, Physgen, Mathgen and Propgen) a set of 400 texts
were used as test corpora (a total of 1600 texts). For each text, the distance to its nearest
neighbour in the sample sets, which was composed of an extra 100 texts per generator
(400 additional texts) was computed. The nearest neighbour was always of the same
nature as the tested text; columns 1,2,3, and 4 of Table 1 show statistical information
about the observed distances.</p>
      <p>Along with that, to determine an upper threshold for genuine texts, a set of 8200
genuine papers from various fields were used. The nearest neighbour for each genuine
text was computed using the same sample sets.</p>
      <p>The first two rows of Table 1 show that, for a genuine paper, the minimal distance
to the nearest neighbour in the sample set (0.52) is always greater than the maximal
distance to the nearest neighbour of a fake paper (0.40).</p>
      <p>By observing the results, we concluded that there would always be a close grouping
of the generated texts that are separated from the group of real texts with a considerable
gap in between. It is safe to say that we can classify the text based on thresholds. Thus,
two thresholds for each generator were set: a lower threshold for generated papers based
on the second row of Table 1 and an upper threshold for genuine papers (vary from
0.52 to 0.56 depending on the generator). Hence, a paper can be identified as possibly
generated in two different ways. First if the distance is lower than the specific threshold
for a generated paper then it is considered as a confirmed case of generated. Second, if
the distance is between the thresholds for generated and genuine paper, it is considered
as a suspicious case.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Comparative Evaluation Between Different Methods</title>
      <p>To thoroughly evaluate SciDetect and other methods, we decided to conduct a
comparative test using different known methods
5.1</p>
      <sec id="sec-4-1">
        <title>Test Candidates</title>
        <p>Pattern Matching: Since automatically generated text has a very limited base of
sentences, it is possible to believe that simply applying a pattern matching technique to
scan a given document and report a specific score whenever a familiar pattern (a string
of words) is encountered might work. In this research, we used a pattern matching tool
that was developed and used internally at Springer that ranks the score as follows: For
each detected phrase (string token) that matches a particular pattern the score is 10. If
the phrase contains five to nine matching words, the score is 50, or 100 for phrases that
have more than nine matching words. The final score is then compared with a threshold
to determine whether the paper is automatically generated or not. If the score is less
than 500, the paper is considered genuine; a score between 500 to 1000 is suspicious (it
may be genuine or fake); and if the score is more than 1000, the paper is considered a
fake.</p>
        <p>This method might not be really reliable since the patterns can be easily modified.
In addition, it is difficult to maintain and update the checker for a new type of generator
for which the grammar is not available. Such approach is also quite sensitive to the
length of the text: the longer the text:, the higher the chance that some specific pattern
will appear.</p>
        <p>
          Kullback-Leibler Divergence: As presented before, this method seems to be currently
used by ArXiv. We implemented our own system that uses a list Sw of 571 stop-words
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] to classify texts. A profile for the average distribution of the stop word frequencies
for each generator was created using the same 400 generated texts in the sample
corpora of SciDetect. Two thresholds for each generator were also established in the same
manner as in section 4 namely, a generate threshold for the maximum KL-divergence
between a profile and a generated text from the test corpus; and a written threshold with
the minimum KL-divergence between a profile and genuine written texts.
SciDetect We would also like to verify the usefulness of our SciDetect system as
presented in Section 4.
5.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Test Corpora</title>
        <p>We used three different corpora to conduct the test:
– Corpus X: 100 texts from known generators (25 for each type of generator) without
any modification.
method</p>
        <sec id="sec-4-2-1">
          <title>True Positive False Positive</title>
          <p>corpus confirm suspect confirm suspect True Negative False Negative</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Pattern Matching</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Kullback-Leibler Divergence</title>
        </sec>
        <sec id="sec-4-2-4">
          <title>SciDetect</title>
          <p>– Corpus Y: 100 generated texts (25 from each generator) that have been modified
by randomly changing a word every two to nine words with a word taken from a
genuine research paper. The aim of this corpus is to test the robustness of these
methods against not only pure generated texts but also modified versions which
have some what different word distribution compared to the samples.
– Corpus Z: 10.000 real texts with a different length ranging from two pages to more
than 100 pages.
These experiments aim at determining the performance of the different methods for
detecting generated papers. The results are shown in Table 2 whereby: true negative
and true positive are respectively when a genuine paper or a generated paper is correctly
identified and vice versa for false negative and false positive.</p>
          <p>Close study of these results highlights several interesting aspects. Considering the
current state of generators, current classifiers all work relatively well (all achieved a
perfect precision rate). Difficult cases (Corpus Y) were marked as suspicious thus
requiring further investigation. Particularly, SciDetect was proven as the most reliable
method–all tests passed at 100%. Furthermore, despite the fact that pattern matching
was designed to only match SCIgen patterns, it was able to recognize three papers from
Scigen-Physic as suspected SCIgen; however, when applied to Corpus Y, one modified
SCIgen paper was mistakenly listed as genuine. One case of a false positive in the
pattern checker with Corpus Z was caused by a large file with more than 110 pages which
triggered an out of memory error.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>There is a need for automatic detection of computer generated papers in scientific
literature. There are also several ways to accomplish this task. Among them, textual distance
was demonstrated to provide the best results and this method was adopted in
SciDetect. Furthermore, SciDetect was tested against pattern matching and Kullback-Leibler
divergence between stop-words. It proved to be the most reliable method for
classification.</p>
      <p>
        However, against other techniques of text generation like Markov chains, SciDetect
and other current methods are impractical since they have an identical word’s
distribution rate as a human written paper and no fixed pattern. This calls for more in-depth
research [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] such as checking the meaning of words [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], citation context[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] or
evaluating sentence construction as well as the styles of generated texts [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was funded by Springer Nature. We would like to thanks our colleagues
from PCM department of Springer Nature who provided valuable insights, expertise
as well as test data that greatly assisted our research; especially to Jeff Iezzi for his
continuous support throughout the process. Also to the reviewers who supply valuable
criticisms of our work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Labbe</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Ike Antkare one of the great stars in the scientific firmament</article-title>
          .
          <source>ISSI Newsletter</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ) (
          <year>2010</year>
          )
          <fpage>48</fpage>
          -
          <lpage>52</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gipp</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Academic search engine spam and google scholars resilience against it</article-title>
          .
          <source>Journal of Electronic Publishing (December</source>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ginsparg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Automated screening:
          <article-title>ArXiv screens spot fake papers</article-title>
          . -
          <volume>508</volume>
          (-
          <fpage>7494</fpage>
          ) (
          <year>March 2014</year>
          ) - -
          <fpage>44</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Labbe´,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Labbe´</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Duplicate and fake publications in the scientific literature: How many scigen papers in computer science</article-title>
          ?
          <source>Scientometrics</source>
          <volume>94</volume>
          (
          <issue>1</issue>
          ) (
          <year>January 2013</year>
          )
          <fpage>379</fpage>
          -
          <lpage>396</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Labbe´,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Roncancio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.:</surname>
          </string-name>
          <article-title>A personal storytelling about your favorite data</article-title>
          .
          <source>In: Proc. ENLG</source>
          . (
          <year>2015</year>
          )
          <fpage>166</fpage>
          -
          <lpage>174</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Reiter</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sripada</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davy</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Choosing words in computer-generated weather forecasts</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>167</volume>
          (
          <year>2005</year>
          )
          <fpage>137</fpage>
          -
          <lpage>169</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tien</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Portet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Labbe´,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Hypertext Summarization for Hotel Review</article-title>
          . hal-
          <volume>01153598</volume>
          (
          <year>March 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Portet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reiter</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hunter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sripada</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sykes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automatic generation of textual summaries from neonatal intensive care data</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>173</volume>
          (
          <year>2009</year>
          )
          <fpage>789</fpage>
          -
          <lpage>816</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Labbe´,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Labbe´</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Portet</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Detection of computer generated papers in scientific literature</article-title>
          .
          <source>(March</source>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>An effective method to identify machine automatically generated paper</article-title>
          .
          <source>In: Knowledge Engineering and Software Engineering</source>
          . (
          <year>2009</year>
          )
          <fpage>101</fpage>
          -
          <lpage>102</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lavoie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnamoorthy</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Algorithmic detection of computer generated text</article-title>
          .
          <source>arXiv preprint arXiv:1008.0706</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dalkilic</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
          </string-name>
          , W.T.,
          <string-name>
            <surname>Costello</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radivojac</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Using compression to identify classes of inauthentic texts</article-title>
          .
          <source>In: Proc. of the 2006 SIAM Conf. on Data Mining</source>
          . (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Fahrenberg</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biondi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corre</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Je´gourel,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Kongshøj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Legay</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Measuring global similarity between texts</article-title>
          . In: Second International Conference, SLSP. (
          <year>2014</year>
          )
          <fpage>220</fpage>
          -
          <lpage>232</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Amancio</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          :
          <article-title>Comparing the topological properties of real and artificially generated scientific manuscripts</article-title>
          .
          <source>Scientometrics</source>
          <volume>105</volume>
          (
          <issue>3</issue>
          ) (
          <year>December 2015</year>
          )
          <fpage>1763</fpage>
          -
          <lpage>1779</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>On the use of similarity search to detect fake scientific papers</article-title>
          .
          <source>In: Similarity Search and Applications - 8th International Conference</source>
          ,
          <string-name>
            <surname>SISAP</surname>
          </string-name>
          <year>2015</year>
          .
          <volume>332</volume>
          -
          <fpage>338</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Singhal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Modern Information Retrieval: A Brief Overview</article-title>
          .
          <source>Bulletin of the IEEE Computer Society Technical Committee on Data Engineering</source>
          <volume>24</volume>
          (
          <issue>4</issue>
          ) (
          <year>2001</year>
          )
          <fpage>35</fpage>
          -
          <lpage>42</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Feinerer</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hornik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Meyer, D.:
          <article-title>Text mining infrastructure in r</article-title>
          .
          <source>Journal of Statistical Software</source>
          <volume>25</volume>
          (
          <issue>5</issue>
          ) (3
          <year>2008</year>
          )
          <fpage>1</fpage>
          -
          <lpage>54</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Labbe´,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Labbe´</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>How to measure the meanings of words? amour in corneille's work</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>39</volume>
          (
          <issue>4</issue>
          ) (
          <year>2005</year>
          )
          <fpage>335</fpage>
          -
          <lpage>351</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Small</surname>
          </string-name>
          , H.:
          <article-title>Interpreting maps of science using citation context sentiments: A preliminary investigation</article-title>
          .
          <source>Scientometrics</source>
          <volume>87</volume>
          (
          <issue>2</issue>
          ) (May
          <year>2011</year>
          )
          <fpage>373</fpage>
          -
          <lpage>388</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kollmer</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          , Po¨schel, T.,
          <string-name>
            <surname>Gallas</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <source>Are physicists afraid of mathematics? New Journal of Physics</source>
          <volume>17</volume>
          (
          <issue>1</issue>
          ) (
          <year>2015</year>
          )
          <fpage>013036</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>