Survey on Plagiarism Challenges
Volodymyr Taranukha
International Research and Training Center for Information Technologies and Systems, 40, Acad. Glushkova av.
  Kyiv, 03187, Ukraine

                 Abstract
                 This paper describes the current state of plagiarism detection and the challenges that arise in
                 modern society and are caused by plagiarism. There are several significant aspects that were
                 highlighted: technological aspects caused by recent developments of modern NLP tools,
                 social aspects caused by the ongoing COVID-19 pandemic, development of new content
                 similarity detection methods, etc. All of them add new aspects to plagiarism challenges.

                 Keywords 1
                 Content similarity detection, plagiarism detection, text similarity, machine learning

1. Introduction

   Modern society is more and more entrapped in the global communication environment. This
ranges from TV to social networks, from science to advertisement, and from entertainment to
propaganda. A significant part of these data sources deals with text in a variety of forms. This resulted
in a surge of text generation techniques on top of old rewriting, appropriation, and plagiarism.
Different areas of communication suffer differently from such malaise. Alongside malice text
generation which plagues social media plagiarism stays one of the worst things that affects the
infosphere. Text generation implemented as a component in bots creates a false image of grassroots
support while hiding actual astroturfing and often creates an echo-chamber effect. This leads to
politicians making wrong decisions with devastating effects. Plagiarism, especially machine-assisted
plagiarism undermines the fundamentals of modern science both in scientific research and in
university study since it's much easier to turn in autogenerated text instead of results of actual study.
   There are some commonalities between content similarity detection, text rewriting, and text
generation. Such commonalities lie in the aspect of text similarity and mathematical tools (measures)
to measure such similarity. Text generation does not necessarily directly connect to this measure, but
there are some links in there at least by usage. It is convenient to use some rewriting tool as one
created by Grammarly [1] in combination with some tool enhanced by GPT-3 [2] to generate some
elements of the text whole cloth. Additional coherence metrics tools [3] can be applied on top of it to
make the whole text more appealing. And this raises the issue of fair use on one side and plagiarism
detection on the other [4]. Search Engine Optimizers (SEO) content creators are free to use text
generation tools since it technically is not plagiarism. However, this creates a significant amount of
online accessible texts with similar stylistics, vocabulary, and such.

2. Background of current trends of plagiarism detection
   As it was noted content similarity detection and plagiarism detection was among areas where the
development never stopped. According to Goolge Scholar [5] the number of indexed articles on
plagiarism detection was steadily raising for the last 5 years at rate over 3,000 a year (2022 has less,
however it is not ended yet).


Information technology and implementation (IT&I-2022), November 30 - December 2, 2021, Kyiv, Ukraine
EMAIL: volodymyr.taranukha@gmail.com (A. 1)
ORCID: 0000-0002-9888-4144 (A. 1)
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                            102
   This avalanche of publications has several trends, social and educational being among them. The
pandemic of COVID-19 forced educational institutions to shift online, with a significant part of the
educational process turning paperless. It in turn caused a surge of plagiarized assignments to be
delivered to the teachers [6]. This is even more pronounced for Ukraine [7] since on top need to
reform the education and socio-economical effects of COVID-19 since February 2022 we have the
Russian invasion which forced many students to relocate away from their schools and universities.
This kind of situation in turn kept the development and integration of plagiarism detection tools
continuing at a hastened pace.

2.1.    Technical aspects of plagiarism
    In order to analyze the plagiarism one needs establish framework: what is plagiarism, what how it
is dealt with, that including data sources and methods, how new technological developments look
like, how it is connected to other issues which can help or muddle the waters more.

2.1.1. Defining plagiarism
    One of the main issues in academic circles is plagiarism. It has been studied for many years in an
effort to reduce plagiarism, preserve the standard of writing, and safeguard the author's rights. A
violation of an author's or writers' copyright is referred to as plagiarism. It refers to using someone
else's ideas or works without giving them due credit. It is observed that there are several definitions
and types for plagiarism.
    Dictionaries define plagiarism in quite straight forward yet insufficient manner. Merriam-Webster
dictionary defines plagiarism as "to steal and pass off (the ideas or words of another) as one's own
and: to commit literary theft" [8]. Cambridge dictionary defines it as "the process or practice of using
another person's ideas or work and pretending that it is your own" [9]. However, there is nothing there
about the means to discern plagiarized text from not plagiarized one.
    There is no final agreement on what is plagiarism, yet categorization of plagiarism is quite
developed and most of researchers agree at least on part of the basics.
    There are such types of commonly recognized plagiarism with some variation; nevertheless the
concepts behind names are mostly the same [10-12].
        1. Copy and paste
        2. Mix and paste
        3. Sacrifice of unimportant part to make text look different
        4. Structural rewriting plagiarism
        5. Translation
        6. Self-plagiarism
    Some authors offer their own classification, such as [13]: secondary source, invalid source,
duplication, paraphrasing, repetitive research, replication, misleading attribution, unethical
collaboration, verbatim plagiarism, and complete plagiarism. However such classifications are often
either very specialized or misleading.
    Also, there are some research venues that try to tackle specific kind of plagiarism, for example,
teachers plagiarism [14], yet there is little success there since sometimes one and the same term
referring to different issues, not to mention the basic question: what is there in teachers plagiarism (or
any other kind of specific plagiarism by the author of plagiarized text) that makes it either worth
researching independently or at least different enough to guarantee own category?

2.1.2. Legal aspects and their influence on field of research
   From a legal standpoint, there are some documents describing what to do and how to treat the text
with plagiarism in it. For example, in Ukraine, there is a hard 30% limit on not own text for scientific
works and including citations. Direct plagiarism without pointing out the source is entirely prohibited.
Any plagiarized (or suspicious of plagiarism) can be ground to strike down the paper entirely on any

                                                                                                      103
level, that is from homework of student to PhD thesis [15]. In India, they use a notably different
approach. UGC [16] stated in the draft policy that an academic misconduct panel should be
constituted by higher educational institutions to investigate cases of plagiarism and submit the report
to the panel. UGC has announced the Indian draft policy on plagiarism for academicians and
researchers with levels of penalty. There is no penalty for 10 percent of similarity in articles, theses,
projects, etc. At level 1 a paper must contain similarities above 10 percent to 40 percent. At level 2 a
paper must contain similarities above 40 percent to 60 percent. At level 3 a paper must contain
similarities above 60 percent. However, UGC declared that a zero-tolerance policy must be used in
core areas of research. And if found then plagiarism disciplinary authority of the higher educational
institutions must use the maximum penalty.
    For the comparison, in the United States plagiarism is not a crime as is. However there is robust
copyright protection along with notions of “breach of contract” (contract cheating [17]) and fraud that
allows stopping and punishing plagiarizer in most cases.
    Yet, as of the examples shown above neither legislative regulation offered clear measures when
and how to discern plagiarized sentences, passages, or documents from non-plagiarized ones. So, I
can conclude that the current state of the art in the legal sphere does not make any significant impact
on actual development of scientific methods and commercial tools intended to combat plagiarism.
    More so, as it was mentioned before, some things such as contract cheating muddle the waters
even more. Contract cheating occurs when students turn in assignments they hired others to complete
for them in order to receive academic credit: human or machine. Since the advent of internet services,
this type of academic fraud has been more prevalent globally and it keeps growing. Many institutions
switched to online exams during the worldwide COVID-19 [18] pandemic, and in Ukraine Russian
invasion exacerbated this problem even more than before. From my own experience plagiarism ended
as the most widely spread underlying aspect of contract cheating in Ukraine.
    Very often automated paraphrasing tools [19] are used for this means which adds a new dimension
to the problem. For example what about a hypothetical scenario in which a student uses such a tool to
paraphrase content from file-sharing websites while citing the original source in a reference list? On
one hand, citing the original source in a reference list suggests that the student did not intend to cheat
and present somebody else’s research as own, yet by most definitions of plagiarism it is plagiarism.
More so, if the assignment is given in non-native language. So, what if a student is writing in their
native tongue, translates it into English, and then runs the text through a paraphrasing tool? It will
result in text that will have some amount of stylistic clues pointing to plagiarism. And when one has a
vague to no idea of what is inside this or that plagiarism analysis tool one will have some errors
during evaluation. However, certain level of obscurity is important for any plagiarism detection tool
with simple (or predictable) rules inside, since automatic paraphrasing tools will exploit any known
weaknesses or data on the internal workings of plagiarism detection tools to make “better” plagiarized
texts.

2.1.3. Plagiarism detection background
    There is very little research in the field on how much is enough to declare something plagiarism.
For example, the work [20] uses MapLemon corpus to tackle this problem quantitatively. This corpus
contains English language essays written by experimental online participants which were asked to
write and submit essays on very specific topics. By having very restrictive guidelines the corpus gives
good representation how one and the same thing can be represented as text. However, MapLemon is
very limited in scale which greatly reduces its value.
    This way, most works concentrate on the tasks of creating some kind of machine learning-based
infrastructure and learning some kind of model parameters without trying to measure where the line
between plagiarism and not actually is, such as [21]. Yet, any machine-learning methodology at its
best gets some weight coefficients to some explicitly or implicitly defined rules. And it is good if
those rules are explicitly defined and analyzed.
    There are some good datasets on plagiarism, including MSRP corpus [22] and the PAN plagiarism
corpus 2011 (PAN-PC-11) [23] which covers how one text can be "creatively" rewritten into another.
However, they do not provide enough diversity to show all necessary variations of plagiarism for

                                                                                                      104
many tasks. So, many researchers use auto-generated and auto-obfuscated texts to somehow fix the
issue and obtain enough data.
    It is important to underline that I do not delve deeply into methods which use images and other
graphical components of documents to discern fraud or plagiarism as in [24]. I assume that this task is
too complex to be reliably automated right now at least until further development of image
comparison tools which will be able to understand structure of image better.
    There are tools and means to detect text similarity and plagiarism in source code [25]. They have
their own niche since the student often need to turn in their source code and it’s useful to check it for
appropriations. Also such tools can provide means to improve programming performance as long as it
comes to boilerplate code, since there is little meaningful difference between similar classes with very
similar common behavior. More so, using auto-generated code, having same approach and
maintaining same standards is actually beneficial to programming performance in general. Yet, I do
not analyze tools such tools in this paper.
    This paper deals with natural language tools since I assume the problem both complex enough to
be worth researching and yet manageable. According with this and regardless of means to create
plagiarized text plagiarism detection tasks are divided into such main categories.
         1. Sub-Lexical
         2. Lexical
         3. Syntactic
         4. Semantic
         5. Stylometric
         6. Structural
         7. Citation
         8. Cross language
    The sub-lexical task deals with spam-derived [26] plagiarizing techniques when some symbols in
the analyzed text were intentionally replaced with similarly looking symbols, for example “i.e.” and
“і.е.” are actually two different strings of symbols from two different languages. As one can see in
this case humans and automatic tools perceive different content in the same text: humans can
potentially see coherent text while plagiarism detection system will see some (not)-coherent text with
some inserts.
    The lexical task focuses on a document's lexical structure (as in [20]). N-grams or some kind of
dictionary fingerprinting (up to the means used in search engines to collapse all similar documents
into a single entry), clustering methods, and longest common subsequence are among the most
popular lexical methods. The system good at first two levels of tasks perform well on copy-paste and
mix-and-paste type of plagiarism. This is the main task for most commercial systems especially if the
system uses Internet to search for potential sources.
    The syntactic task analyses and tracks positional syntactic changes [27] and can partially address
minor paraphrases which are not semantic in nature. It is especially important in Ukrainian and other
languages leaning on the synthetic side of the linguistic spectrum in contrast with English and other
analytic leaning languages.
    The semantic task analyses the meaning of a document by considering synonyms, antonyms, and
semantic similarity/distance. Both SEO and plagiarizers often use simple synonym substation to make
semantically similar text with different appearance. So, embeddings (vector semantic-based methods)
are common in the systems solving semantic task now. While Latent Semantic Analysis is still in use,
different embeddings based on deep neural networks [28] are steadily overtaking everywhere in the
field of Natural Language Processing and plagiarism detection is no exclusion to this process. For a
language like Ukrainian, it is also quite important since we have significant room for verb-to-noun
and noun-to-verb transformation (for example "будівництво" and "будують" in many contexts are
interchangeable while they are different parts of speech and have different syntactic roles).
    The stylometric task is an approach to the document as a single entity with a single style. It’s a
complex approach that extensively uses tools form lexical task in combination with syntactic distance
measurement. It is a statistical method that analyses an author's style under the assumption that each
author has a consistent style of writing. While it works fine for native speakers with consistent
language habits yet for non-native speakers discrepancies observed in the produced text style often


                                                                                                     105
depend on the amount of effort and time poured into editing certain parts of the text. That is why I
consider this task as the least important of all.
    The structural task analyses how structural features such as keywords, headers, paragraphs, and
references are presented along with the distribution of the words in a document. Graph comparison
approaches are widely used in structural task [29]. However, due to limitation of underlying
mathematical problem of sub-graph isomorphism this is not used very often.
    The citation [30] task compares sets of source documents (and occasionally their order), it's a very
fast and efficient tool for finding big-block text appropriations.
    The cross-language task was among the hardest tasks in this list, partially covered by solving the
citation task. However, the development of machine translation made doing it easier. Later
development of cross-language deep neural networks provided efficient embedding methods [31]
which allowed the efficient crossing of the gap between languages [32].
    Development of Deep Neural Networks allowed setting a new ambitious task: image plagiarism
detection. It is relatively new (first paper available at Google Scholar dates back to 2012) and
extremely underdeveloped.

2.1.4. Plagiarism analysis software types
    The tools are divided into standalone and online. WCopyFind[33] is an example of a standalone
program while iThenticate[34] works online without the need to install any software.
    There are three types of tools based on data source usage. Internal database tools such as
CopyCatch [35] and WCopyFind detect plagiarism within a database. External database tools check
the similarity of available external sources on Internet such as EVE2 and EduTie [36]. And some
tools such as Turnitin [37] and iThenticate use both internal and external databases for plagiarism
detection.
    In respect to language, there are monolanguage, multilanguage (operating as monolanguage tools
for the set of languages), and cress-language tools [32].
    As for now, standalone internal database monolanguage systems are the most prolific systems both
at the commercial and research side.

2.2.    Neural networks at plagiarism detection
    Nowadays more and more plagiarism detection systems rely on ML-based subsystems to let them
learn the rules which define if the passage under consideration is plagiarized or not implicitly.
Powerful neural networks along with significant advantages, like an ability to solve cross-language
plagiarism detection, also create the drawback: it is effectively impossible to dig out why this or
passage was labeled as suspicious. Nevertheless, NN based ML is the best method available now for
plagiarism detection.
    Any plagiarism detection system must have a dataset. Since there is not enough good datasets on
plagiarism many researchers use auto-generated texts either to pad the available human-generated
datasets or use entirely machine-generated datasets. So, the issue of machine-generated examples was
analyzed.
    In [38] the effectiveness of six different word embedding models in combination with five
classifiers for distinguishing human-written from machine-paraphrased text was evaluated. The most
important part it that best performing automatic classification approach achieves an accuracy of
99.0% for documents and 83.4% for paragraphs. This is useful result showing that in spite of
explosive development of machine-generated plagiarizer systems such systems still leave notable
signs in the plagiarized texts enabling specific ML-training to combat the plagiarizers most often used
by students.
    In [39] machine-paraphrased plagiarism was analyzed. The effectiveness of five pre-trained word
embedding models was evaluated, combined with machine learning classifiers and state-of-the-art
Neural Networks language models. Preprints of research papers, graduation theses, and Wikipedia
articles were paraphrased using different configurations of the tools SpinBot [40] and SpinnerChief
[41]. The best performing technique in the paper, Longformer [42], achieved an average F1 score of

                                                                                                    106
0.8099 (F1=0.9968 for SpinBot and F1=0.7164 for SpinnerChief cases), while human evaluators
achieved F1=0.784 for SpinBot and F1=0.656 for SpinnerChief cases. The authors conclusively
showed that this approach outperforms both humans and usual methods implemented in commercial
systems such as Turnitin.
    However, in [43] researchers reported that GloVe[44] can outperform BERT under certain
conditions. It must be noted that the research is centered on the concept of “tortured phrases” that
often appear out of misused translation. For example such phrases include “counterfeit
consciousness” which is used instead of “artificial intelligence”. It becomes more prominent if the
plagiarism was generated by circular translation such English-German-English. Usage of automatic
translation tools also contributes to probability of appearance of such tortured phrases. The
researchers used cosine score to prove that: they claim that GloVe embeddings produce cosine score
of 0.12 for tortured phrases and 0.3 for normal phrases while BERT embedding gives 0.5 for tortured
phrases and 0.55 for normal phrases. The researchers explained it as excessive influence of context in
BERT model preventing the system from generalization unlike GloVe embeddings which are context-
free. Also, it must be noted that such weak results were received when the final part of architecture
that perform actual decision if the sentence is potentially plagiarized is simple enough. So, hunt for
this kind of phrasing is a good kind hypothetical of evidence of plagiarism, yet it must not replace
more general keys detectable by more general tools.
    It can be concluded that automatic generation of plagiarized text produces better detectable texts
no matter the method of generation: rephrasing or translation. So, it is better to use human made or at
worst machine assisted and human controlled datasets.
    It can be assumed that the most important part of Neural Networks usage in plagiarism detection is
embedding that represents vector semantics.
    In [45] an experiment with different embeddings was described. The results show that the BERT
pre-trained model offers the best results and outperforms GloVe and RoBERTa in monolanguage
task. This is half-expected since BERT usually outperforms simpler (or simplified) embedding
methods. The authors used indexing ranking as metrics with BERT and RoBERTa offered ranking
0.76 and 0.72 while GloVe+TF-IDF offered 0.57. It has to be noted that a more robust RoBERTA
does not allow getting enough collateral drift to improve plagiarism detection specifically like it was
expected after the results shown in [43].
    In [46] cross-language plagiarism detection with contextualized word embeddings was analyzed.
The evaluation experiments show that contextualized word embeddings is an appropriate approach
that improves performance greatly. SBERT[47] is used to make embeddings of the whole sentences in
contrast with [39] where Longformer was used. It must me mentioned that the method designed in
[46] does not use any translation system. The tests performed have demonstrated that it works for
different language pairs such as English-French, English-Spanish, English-Portuguese, and English-
Russian. For PAN-PC-12 Spanish partition F1 = 0.7938 and for PAN-PC-12 German partition F1 =
0.778. The English-Russian comparison is very important since the languages of English-Russian pair
is almost at the different ends of the synthetic-analytic language scale, with drastically different
syntactic structures on top of notably different semantics.
    In [48] another cross-language research was performed. The authors claimed accuracy of 0.9701.
While not so impressive like [46] nevertheless, Arabic-English experimental results showed that using
deep neural networks with rich semantic features achieves encouraging results.
    In [49] attempt at cross-language plagiarism detection research for English-Persian pair was
performed. Unlike most papers on cross-language plagiarism in this one significant effort was spent
on combating post-translation obfuscation. The percentage of text with different obfuscation methods
are: no obfuscation 29%, mechanical obfuscation 29%, human paraphrasing 10%, summarization 8%,
circular translation 10%, split 9%, merge 5%. For Persian language the most efficient method among
analyzed was translation plus monolanguage plagiarism detection (the suspicions documents were
translated with Google translate API from Persian into English). Using this approach they managed to
achieve score of F1=0.713.
    The performance of BERT in plagiarism detection task shows that some kind high-end context-
sensitive embedding and some kind of complex final resolution mechanism are both absolutely
necessary to achieve good result in plagiarism detection.
    Longformer and SBERT are not the only methods to process long springs.

                                                                                                   107
   In [50] architecture based on a Long Short Term Memory (LSTM) and attention mechanism called
LSTM-AM-ABC boosted by a population-based approach for parameter initialization was proposed.
The paper employs a population-based metaheuristic algorithm (Artificial Bee Colony) to solve the
problem. The algorithm can find the initial values for model learning in all LSTM, attention
mechanism, and feed-forward neural network, simultaneously. On MSRP dataset and compared to
several other methods including Siamese CNN+LSTM[51] and CETE [52] the method showed the
best performance with average score of F1=0.857.
   In [53] to model the “partial matching” between documents, a Partial Matching Convolutional
Neural Network (PMCNN) was proposed for source retrieval. PMCNN exploits a sequential
convolution neural network to extract the plagiarism patterns of contiguous text segments. The
experimental results on PAN 2013[54] and PAN 2014[55] plagiarism source retrieval corpus has
shown that PMCNN can boosts the performance of source retrieval significantly compared to ranking
SVM-based approach [56]. General performance of NN on PAN 2013 and PAN 2014 corpora gives
F1=0.6171 and F1=0.5474 respectively. The paper one more time confirms that neural networks
outperform other methods. The important contribution however consists in usage of CNN, since this
type of feed-forward networks is one that has the best chance to be analyzed for the purpose of
extracting knowledge unlike LSTMs.
   In [57] very ambitious task of plagiarism detection in the image-based medium was undertaken.
The reasoning behind the research is sound: it takes much more effort to plagiarize images to the
same degree of being unrecognizable compared to text thus making such a system a valuable addition
to any large-scale plagiarism detection system, especially if it is a commercial one. It is important to
notice that there are very few papers on the subject. I was able to find only 52 papers in Google
Scholar 32 of them were published since 2018.
   The paper [57] proposed a system that can potentially cover the usual flaws of image plagiarism
detection systems. The research is focused on flowcharts, since it is the most vulnerable image type,
however it is suitable for any kind of images, as it was shown in experimental section. Alas, while the
system can detect unedited and flipped images with high accuracy, yet the accuracy goes down
drastically if operations such as rotation, greyscaling, and cropping were applied. Rotated image can
be detected with 80% accuracy while grayscling reduces the accuracy to 20%. The most telling
problem is a drop in accuracy for cropped images to 60%. It defeats implementation of idea to convert
the flowchart into a directed graph. Hypothetically such approach must enable the system to detect the
shape of the flowcharts under any positioning changes, as long as the graph stays the same. Yet for
some reason it failed which indicates that one needs another approach to the task.
   In my opinion, any such system can be used as an auxiliary to a text-based one but will neither
gain the same efficiency nor can serve as the main tool. The problems with any image plagiarism
detection system are exacerbated by the rapid development of image-generation Neural Network-
based systems such as DALL-E [58], which can create image whole cloth in any style out of text
description. Any plagiarizer can describe any image (especially one such as flowchart) and feed the
description into image generation engine in order to receive originally looking yet totally plagiarized
image.

3. Conclusions

   The field of plagiarism detection is undergoing rapid development in some areas while staying
stagnant in others. The most complex task in direct text plagiarism detection was a cross-language
task. And now powerful cross-language tools to solve the task are already developed and will
continue development in the foreseeable future. For now the most complex task is image plagiarism
detection.
   However, the legislative part of the issue is lacking and most probably will stay lacking as long as
researchers are unable to make some tools able to explain why and how this or that passage was
labeled as suspicious. It is a task as hard as any task of extracting knowledge from the neural network.
And with the development of Deep Neural Networks, the problem was only exacerbated. Also, I do
not expect problem of interaction between scientist and legislators to be resolved by agreement


                                                                                                    108
between designers of commercial plagiarism detection software by creating de-facto industry standard
which can serve as agreeable common ground.
   More so, there is no golden standard for what is plagiarism and what is not neither on the national
level nor on the international, so the practice of relying on automated tools to evaluate any paper or
student assignment will still produce a significant amount of friction between students and teachers.
At least with research papers there is tried and tested peer review which while being slow and not
perfect solves the issues of plagiarism detection in most cases.

4. Acknowledgements

   The author would like to acknowledge the following people for their contributions to the research:
prof. Anisimov A.V. from faculty of Computer Sciences and Cybernetics, Taras Shevchenko National
University of Kyiv for useful suggestions on the nature of natural language texts and general support;
staff members of dpt. 165 of IRTC IT &S, Kyiv for libraries and support provided during this
research.

5. References

[1] Introducing        Grammarly’s        New        Tone       Rewrite       Suggestions     URL:
     https://www.grammarly.com/blog/tone-rewrite-suggestions
[2] T. Brown, et al., Language models are few-shot learners. Advances in neural information
     processing systems, (2020), 33: 1877-1901.
[3] O. O. Marchenko, O. S. Radyvonenko, T. S. Ignatova, et al., Improving Text Generation
     Through Introducing Coherence Metrics, Cybernetics and Systems Analysis 56 (2020) 13–21.
     doi:10.1007/s10559-020-00216-x
[4] T. Bretag, S. Mahmud, A model for determining student plagiarism: Electronic detection and
     academic judgment. Journal of University Teaching & Learning Practice. 6. (2009)
     10.53761/1.6.1.6.
[5] Google Scholar URL: https://scholar.google.com/
[6] K.A. Gamage, E.K.D. Silva, N. Gunawardhana, Online delivery and assessment during COVID-
     19: Safeguarding academic integrity. Education Sciences, 10(11), (2020) 301.
[7] V. Luniachek, et al,. "Academic integrity in higher education of Ukraine: current state and call
     for action." Education Research International (2020): 1-8.
[8] Marriam-Webster, “Dictionary by Merriam-Webster: America’s most-trusted online dictionary,”
     URL: https://www.merriamwebster.com/.
[9] Cambridge Dictionary (online), “PLAGIARISM | meaning in the Cambridge English
     Dictionary,” URL: https://dictionary.cambridge.org/dictionary/english/plagiarism.
[10] L. Bornmann, "Research misconduct—definitions, manifestations and extent." Publications 1.3
     (2013): 87-98.
[11] H. Sharma, S. Verma, Insight into modern-day plagiarism: The science of pseudo research. Tzu-
     Chi Medical Journal, 32(3), (2020), 240.
[12] D. Weber-Wulff, False feathers: A perspective on academic plagiarism, Springer-Verlag, Berlin,
     2014.
[13] Roig, Miguel. "Plagiarism and self-plagiarism: What every author should know." Biochemia
     Medica 20.3 (2010): 295-300.
[14] R. M. Ghiațău, L. Mâță, University Teachers Plagiarism-A Preliminary Review of
     Research. BRAIN. Broad Research in Artificial Intelligence and Neuroscience, 10, (2019) 22-32.
[15] Ministry of Education and Sciences of Ukraine Order 40 12.01.2017 URL:
     https://zakon.rada.gov.ua/laws/show/z0155-17#Text
[16] UGC Policy on Plagiarism. 2017. https://www.ugc.ac.in/pdfnews/8864815_UGC-Public-Notice-
     on-Draft-UGC-Regulations,-2017.pdf
[17] K. Ahsan, S. Akbar, B. Kam, Contract cheating in higher education: a systematic literature
     review and future research agenda. Assessment & Evaluation in Higher Education. 2022, 47(4)
     523-39.

                                                                                                  109
[18] S. E. Eaton, K. L. Turner, Exploring academic integrity and mental health during COVID-19:
     Rapid review. Journal of Contemporary Education Theory & Research (JCETR), 2020, 4(2), 35-
     41.
[19] J. Roe, M. Perkins, What are Automated Paraphrasing Tools and how do we address them? A
     review of a growing threat to academic integrity. Int J Educ Integr 18, 15 (2022).
     https://doi.org/10.1007/s40979-022-00109-w
[20] P. Juola, "How much overlap means plagiarism? A controlled test corpus." Concurr. Sess 12
     (2022): 13-14.
[21] V. Vrublevskyi, O. Marchenko, "Development and Analysis of a Sentence Semantics
     Representation Model." Cybernetics and Systems Analysis 58.1 (2022): 16-23.
[22] B. Dolan, C. Quirk, C. Brockett Unsupervised construction of large paraphrase corpora:
     Exploiting massively parallel news sources. Proc. 20th International Conference on
     Computational Linguistics (COLING 2004). (23–27 August 2004, Geneva, Switzerland).
     Geneva, 2004. P. 350–356. URL: https://aclanthology.org/C04-1051
[23] The PAN plagiarism corpus 2011 URL: https://webis.de/data/pan-pc-11.html
[24] M.A.G. van der Heyden, The 1-h fraud detection challenge. Naunyn-Schmiedeberg's Arch
     Pharmacol 394, (2021) 1633–1640. https://doi.org/10.1007/s00210-021-02120-3
[25] M. H. Ismail, M. M. Lakulu, A Critical Review on Recent Proposed Automated Programming
     Assessment Tool. Turk. J. Comput. Math. Educ, 12, (2021), 884-894.
[26] S. A. Rojas-Galeano, "Revealing non-alphabetical guises of spam-trigger vocables." Dyna
     80.182 (2013): 50-57.
[27] K. Vani, G. Deepa, "Text plagiarism classification using syntax based linguistic features." Expert
     Systems with Applications 88 (2017): 448-464.
[28] Y. Wang, et al., "A comparison of word embeddings for the biomedical natural language
     processing." Journal of biomedical informatics 87 (2018): 12-20.
[29] M. Franco-Salvador, et al., "Cross-language plagiarism detection over continuous-space-and
     knowledge graph-based representations of language." Knowledge-based systems 111 (2016), 87-
     99.
[30] B. Gipp, J. BEEL, Citation based plagiarism detection: a new approach to identify plagiarized
     work language independently. In: Proceedings of the 21st ACM Conference on Hypertext and
     Hypermedia. (2010), 273-274.
[31] J. Devlin, et al., "Bert: Pre-training of deep bidirectional transformers for language
     understanding." arXiv preprint arXiv:1810.04805 (2018). et al. "Bert: Pre-training of deep
     bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
[32] T. Pires, E. Schlinger, D. Garrette, "How multilingual is multilingual BERT?." arXiv preprint
     arXiv:1906.01502 (2019)
[33] Wcopyfind URL: https://plagiarism.bloomfieldmedia.com/software/wcopyfind/
[34] iThenticate URL: https://www.ithenticate.com/
[35] CopyCatch URL: https://www.elute.io/copycatch
[36] T. Lancaster and F. Culwin, “Classifications of plagiarism detection engines,” Innov. Teach.
     Learn. Inf. Comput. Sci., vol. 4, no. 2, (2005) 1–16
[37] TurnItIn URL: https://www.turnitin.com/
[38] Foltýnek, Tomáš, et al., "Detecting machine-obfuscated plagiarism." International Conference on
     Information. Springer, Cham, (2020) 816-827
[39] J. P. Wahle, et al. Identifying machine-paraphrased plagiarism. In: International Conference on
     Information. Springer, Cham, (2022). 393-413
[40] K. Dey, R. Shrivastava, S. Kaushik, A Paraphrase and Semantic Similarity Detection System for
     User Generated Short-Text Content on Microblogs. In: Proceedings International Conference on
     Computational Linguistics (Coling), (2016), 2880–2890
[41] T. Foltýnek, N. Meuschke, B. Gipp, Academic Plagiarism Detection: A Systematic Literature
     Review.         ACM        Computing         Surveys      52(6),      (2019),       112:1–112:42
     https://doi.org/10.1145/3345317F
[42] I. Beltagy, M.E. Peters, and A. Cohan Longformer: The Long-Document Transformer.
     arXiv:2004.05150 (2020)


                                                                                                   110
[43] P. Lay, M. Lentschat, C. Labbé, Investigating the detection of Tortured Phrases in Scientific
     Literature. In: Proceedings of the Third Workshop on Scholarly Document Processing. (2022)
     32-36.
[44] J. Pennington, R. Socher, and C. Manning. "Stanford glove: Global vectors for word
     representation." (2017).
[45] R. Rosu et al., "NLP based Deep Learning Approach for Plagiarism Detection." RoCHI-
     International Conference on Human-Computer Interaction, Romania. 2021
[46] D. D. A. Vaz, Cross language plagiarism detection with contextualized word embeddings, (2021)
[47] N. Reimers, I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-
     Networks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language
     Processing. Association for Computational Linguistics (2019) 671-688
[48] S. Alzahrani, H. Aljuaid, Identifying cross-lingual plagiarism using rich semantic features and
     deep neural networks: A study on Arabic-English plagiarism cases. Journal of King Saud
     University-Computer               and           Information            Sciences,            (2020),
     https://doi.org/10.1016/j.jksuci.2020.04.009
[49] H. Asghari, et al., On the use of word embedding for cross language plagiarism detection.
     Intelligent Data Analysis. 23(3) (2019). 661-680. https://doi.org/10.3233/IDA-183985
[50] S.V. Moravvej et al., An LSTM-based plagiarism detection via attention mechanism and a
     population-based approach for pre-training parameters with imbalanced classes. In: International
     Conference on Neural Information Processing. Springer, Cham, (2021). 690-701.
[51] M.T.R.Laskar, X. Huang, and E. Hoque. Contextualized embeddings based transformer encoder
     for sentence similarity modeling in answer selection task. in Proceedings of The 12th Language
     Resources and Evaluation Conference. (2020).
[52] E. L. Pontes, et al., "Predicting the semantic textual similarity with siamese CNN and LSTM."
     arXiv preprint arXiv:1810.10641 (2018).
[53] L. Kong, et al., "A Partial Matching Convolution Neural Network for Source Retrieval of
     Plagiarism Detection." IEICE TRANSACTIONS on Information and Systems 104.6 (2021):
     915-918.
[54] M. Potthast, et al., “Overview of the 5th international competition on plagiarism detection,” Proc.
     CLEF 2013 Evaluation Labs and Workshop, Valencia, Spain, (2013) 301–331
[55] M. Potthast, et al., “Overview of the 6th international competition on plagiarism detection,” Proc.
     CLEF 2014 Evaluation Labs and Workshop, Sheffield, United Kingdom, (2014) 845–876
[56] L. Kong, et al., "A ranking approach to source retrieval of plagiarism detection." IEICE
     TRANSACTIONS on Information and Systems 100.1 (2017): 203-205
[57] A. S. B. Ibrahin, O. O. Khalifa, D. E. M. Ahmed, Plagiarism Detection of Images. In 2020 IEEE
     Student Conference on Research and Development (SCOReD) IEEE, (2020) 183-188
[58] DALL·E: Creating Images from Text URL: https://openai.com/blog/dall-e/


                                                                                                    111