Style Breach Detection with Neural Sentence Embeddings Notebook for PAN at CLEF 2017

Style Breach Detection with Neural Sentence Embeddings Notebook for PAN at CLEF 2017 KamilSafin safin@ap-team.ru Moscow Institute of Physics and Technology RitaKuznetsova kuznetsova@ap-team.ru Moscow Institute of Physics and Technology AntiplagiatCjsc Moscow Institute of Physics and Technology Style Breach Detection with Neural Sentence Embeddings Notebook for PAN at CLEF 2017 23CF05E570BDD2EE77A40F83924B7AB9 GROBID - A machine learning software for extracting information from scholarly documents

The paper investigates method for the style breach detection task. We developed a method based on mapping sentences into high dimensional vector space. Each sentence vector depends on the previous and next sentence vectors. As main architecture for this mapping we use the pre-trained encoder-decoder model. Then we use these vectors for constructing an author style function and detecting outliers. Method was tested on the PAN-2017 collection for the style breach detection task.

Introduction

Developing approach for identifying different authors within a single document has been an open problem at the natural language processing. There were several tasks related to this problem in PAN competition:

1. Intrinsic plagiarism detection problem [10,14,7] -given a suspicious document that there exists one main author who wrote at least 70% of the text. Up to the other 30% may be written by other authors. The task is to determine whether the document is written by a single author or contains fragments by another authors. Unlike external plagiarism problem, the reference collection is unknown [16]. 2. Author diarization problem [13] -given the document, that written by n authors, no main author is given. The task is to determine exactly n authors in the document, where the number n can be known or unknown.

The most algorithm's work is based on the following scheme:

1. divide a text into blocks according to the segmentation scheme (e.g. sentences, ngrams, overlapping blocks), 2. map each block to feature space (e.g. n-gram frequency [1,12], punctuation, partof-speech tags count [5]) and combine features to an author style function (character 3 -gram frequencies, n -gram classes (i.e. the inverted frequencies), normalized word frequency class), 3. find critical values in the author style function to detect plagiarized blocks. The author diarization algorithms [4] use segmentation of classifier statistics if the number of authors is known and the clustering approach if the the number of authors is unknown.

PAN -2017 [9] competition provided modified problem statement -style breach detection [15]. Given a document, determine whether it is multi-authored, and if yes, find the borders where authors switch. For this task we proposed the approach based on neural phrase embeddings. First, we split a document into sentences and map each sentences in high dimensional vector space using pretrained encoder-decoder model named skipthoughts model from [3]. Each sentence vector depends on the sentence vector before and after it. After that, we construct the similarity matrix between all sentences in document and detect outliers.

The quality of the model was measured by W indowDif f [6] and W inP, W inR, W inF [11] metrics. All experiments were carried out on TIRA [8].

Style Breach Detection

Denote D the collection of text documents. Each document d ∈ D is written by unknown number of authors. The task is to find borders where authors switch. All documents may contain zero up to arbitrarily many switches. Thereby switches of authorship may only occur at the end of sentences, i.e. not within. We formulate style breach detection problem as finding sentences-outliers problem. Text document d ∈ D consists of sentences: d = ∪ N i=1 s i , where N -number of sentences in text. Each of sentences s i we vectorize, using pre-trained skip-thoughts model: s i → s i . Then, statistic for sentences stat(s i ) is built, and the problem is to find sentences, which statistic is bigger than statistic of other sentences, in other words, the goal is to find sentence vectors, which statistic is exceeded the threshold: stat(s i ) > δ ⇒ s i is outlier.

Experiment

Quality criteria

To evaluate the predicted style breaches two metrics were used:

-WindowDiff metric was proposed for general text segmentation evaluation. It gives an error rate (between 0 to 1, where 0 indicates a perfect prediction) for predicting borders by penalizing near-misses less than other/complete misses or extra borders. This metric computes as follows:

W indowDif f (ref, hyp) = 1 N − k N −k i=1 (|b(ref i , ref i+k )−b(hyp i , hyp i+k )| > 0),

where b(i, j) represents the number of boundaries between positions i and j in the text and N represents the number of sentences in the text, ref and hyp are reference and hypothetical segmentations.

a more recent adaption of WindowDiff metric is WinPR metric. It enhances it by computing the common information retrieval measures precision (WinP) and recall (WinR) and thus allows to give a more detailed, qualitative statement about the prediction.

T rue P ositives

= T P = N i=1−k min(R i,i+k , C i,i+k ), T rue N egatives = T N = −k(k − 1) + N i=1−k (k − max(R i,i+k , C i,i+k )),F alse P ositives = F P = N i=1−k max(0, C i,i+k − R i,i+k ), F alse N egatives = F N = N i=1−k max(0, R i,i+k − C i,i+k ),

where R and C represent the number of boundaries from the reference and computed segmentations, respectively, in the i th window, up to a maximum of k; N is the number of content units and k represents the window size.

And WinP, WinR, WinF are computed as:

W inP = T P T P + F P , W inR = T P T P + F N , W inF = 2 • W inP • W inR W inP + W inR

Feature construction

The raw text document d is splitted into sentences s i using standart NLTK's sentence tokenizer [2]. Each sentence is vectorized by pre-trained skip-thoughts model 1 . Skipthoughts model belongs to the class of encoder-decoder models. That is, encoder part maps word embeddings to a sentence vector and decoder generates surrounding sentences. Skip-thought vectors consist of two separate models. One is an unidirectional encoder with 2400 dimensions, which is referred to as uni-skip. The other is a bidirectional model with 2400 dimensions, that contains forward and backward encoders of 1200 dimensions each. This model is referred to as bi-skip.

Encoder. Let w 1 i , . . . , w N i be the words in sentence s i and N is the number of words in sentence. On each step, encoder generates hidden state h t i , which can be interpreted as the representation of the sequence w 1 i , . . . , w t i . And the final hidden state h N i := s i is the vector representation of the full sentence s i .

z t = σ(W zx x t + W zh h t−1 ), r t = σ(W rx x t + W rh h t−1 ), ht = tanh(W x x t + W h (r t • h t−1 )),(1)h t = (1 − z t ) • h t−1 + z t • ht ,

where (W zx , W zh , W rx , W rh , W x , W h ) -parameters of LSTM type encoder, x t -vector representation of word w t , (•) denotes a component-wise product.

Decoder. The decoder is a model which conditions on the encoder output s i . Decoder part is similar to encoder part, but applied to next s i+1 and previous s i−1 sentences.

Objective. Given a tuple (s i−1 , s i , s i+1 ) the objective optimized is the sum of the logprobabilities for the forward and backward sentences conditioned on the encoder representation.

Consider the dataset S = {s i } consisting of the sentences s i = (x 1 , . . . , x n ) where x k ∈ X is a word embedding. Our goal is to learn representations for variable-sized phrases in unsupervised training regime. We use the encoder-decoder model (GRU-GRU) described in [3].

To build statistics, we construct pairwise distance matrix M = {m ij } N i,j=1 , where N is the number sentences in text. For each pair of sentences (s i , s j ) cosine distance is computed:

m ij = cos(s i , s j ).

Statistic for each sentence is built as mean cosine distance to all other sentences in text:

stat(s i ) = 1 N j =i cos(s i , s j ).

To detect borders, where authors switch, we accept the hypothesis, that sentences around the borders are differ from other sentences in text. Outliers are defined as sentences, which statistic is bigger than threshold δ: stat(s i ) > δ ⇒ s i is outlier.

The example of work of the algorithm is shown below. Green line denotes threshold value, red lines mark detected sentences-outliers. Blue dots on pairwise distance matrix denote real borders.

Parameters Tuning

The threshold δ was tuned in order to maximize the final performance measure -W inF . Also, to compress model and analyze the properties of skip-thoughts vectors, different parts of these vectors were used for statistic calculations, specifically:

whole 4800-dimensional skip-thoughts vectors, -2400-dimensional uni-skip vectors, -2400-dimensional bi-skip vectors.

The results of parameter tuning are shown on figures below.

Results

The proposed algorithm was tested on PAN-2017

Conclusion

We proposed algorithm for style breach detection task. This method splits text into sentences, vectorizes it and then builds statistics for sentence vectors to detect sentencesoutliers.

The method was implemented to the PAN-2017 competition in style breach detection task. The model achieved WinF measure 0.28 on the test dataset.

Figure 1 :1Figure 1: Example of pairwise distance matrix and statistic for sentences

Figure 2 :2Figure 2: Pairwise distance matrix and statistic for sentences with threshold(green) and detected outliers(red)

Figure 3 :3Figure 3: Skip-vectors model parameters tuning.

Figure 4 :4Figure 4: Uni-vectors model parameters tuning.

Figure 5 :5Figure 5: Bi-vectors model parameters tuning.

Figure 6 :6Figure 6: Uni-vectors model precise parameters tuning.

Table 1 :1style breach detection training and test datasets. Results of its work are shown in table below. Results on PAN'17 data setWindowDiff WinP WinR WinFtraining dataset0.620.27 0.61 0.24test dataset0.530.37 0.54 0.28

https://github.com/ryankiros/skip-thoughts

Intrinsic plagiarism detection using n-gram classes IBensalem PRosso SChikhi EMNLP 2014 Nltk: the natural language toolkit SBird Proceedings of the COLING/ACL on Interactive presentation sessions the COLING/ACL on Interactive presentation sessions 2006 Skip-thought vectors RKiros YZhu RSalakhutdinov RSZemel ATorralba RUrtasun SFidler arXiv:1506.06726 2015 arXiv preprint Methods for intrinsic plagiarism detection and author diarization MKuznetsov AMotrenko RKuznetsova VStrijov Notebook for PAN at CLEF 2016 2016 Approaches for intrinsic and external plagiarism detection GOberreuter GL'huillier SARíos JDVelásquez Proceedings of the PAN the PAN 2011 A critique and improvement of an evaluation metric for text segmentation LPevzner MAHearst Computational Linguistics 2002 Overview of the 4th international competition on plagiarism detection MPotthast TGollub MHagen JKiesel MMichel AOberländer MTippmann ABarrón-Cedeño PGupta PRosso BStein CLEF (Online Working Notes/Labs/Workshop Citeseer 2012 Improving the Reproducibility of PAN's Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling MPotthast TGollub FRangel PRosso EStamatatos BStein Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14 EKanoulas MLupu PClough MSanderson MHall AHanbury EToms

Berlin Heidelberg New York

Springer Sep 2014 Overview of PAN'17: Author Identification, Author Profiling, and Author Obfuscation MPotthast FRangel MTschuggnall EStamatatos PRosso BStein Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17) GJones SLawless JGonzalo LKelly LGoeuriot TMandl LCappellato NFerro

Berlin Heidelberg New York

Springer Sep 2017 An evaluation framework for plagiarism detection MPotthast BStein ABarrón-Cedeño PRosso Proceedings of the 23rd international conference on computational linguistics the 23rd international conference on computational linguistics 2010 Getting more from segmentation evaluation MScaiano DInkpen Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2012 Intrinsic plagiarism detection using character n-gram profiles EStamatatos 2009 Clustering by authorship within and across documents EStamatatos MTschuggnall BVerhoeven WDaelemans GSpecht BStein MPotthast CEUR Workshop Proceedings 2016 Overview of the 3rd international competition on plagiarism detection BStein Barron LCedeno AEiselt MPotthast PRosso CEUR Workshop Proceedings 2011 MTschuggnall EStamatatos BVerhoeven WDaelemans GSpecht BStein MPotthast Working Notes Papers of the CLEF 2017 Evaluation Labs LCappellato NFerro LGoeuriot TMandl External and intrinsic plagiarism detection using vector space models MZechner MMuhr RKern MGranitzer Proc. SEPLN 32 2009