<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection Notebook for PAN at CLEF 2017</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Daniel</forename><surname>Karaś</surname></persName>
							<email>dkaras@opi.org.pl</email>
							<affiliation key="aff0">
								<orgName type="institution">National Information Processing Institute</orgName>
								<address>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martyna</forename><surname>Śpiewak</surname></persName>
							<email>mspiewak@opi.org.pl</email>
							<affiliation key="aff0">
								<orgName type="institution">National Information Processing Institute</orgName>
								<address>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Piotr</forename><surname>Sobecki</surname></persName>
							<email>psobecki@opi.org.pl</email>
							<affiliation key="aff0">
								<orgName type="institution">National Information Processing Institute</orgName>
								<address>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection Notebook for PAN at CLEF 2017</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CAEA37D3E669E378F80A6C4BC48FEFC5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper, we propose methods for author identification task dividing into author clustering and style breach detection. Our solution to the first problem consists of locality-sensitive hashing based clustering of real-valued vectors, which are mixtures of stylometric features and bag of n-grams. For the second problem, we propose a statistical approach based on some different tf-idf features that characterize documents. Applying the Wilcoxon Signed Rank test to these features, we determine the style breaches.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>1 Author Clustering</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Introduction</head><p>Author Clustering task consists of two distinct problems: author clustering and authorship link ranking. Solving first of the scenarios means assigning each of the m given documents to k clusters, where k is unknown and has to be approximated, where each of the k clusters corresponds to a single author. On the other hand, authorship link ranking can be understood as assigning intra-cluster confidence scores to document pairs, where a higher score indicates greater similarity between documents.</p><p>Both problems have to be solved for multiple collections of up to 50 documents. The additional difficulty lies in fact, that document batches were created in 3 different languages -English, Dutch, and Greek. This property makes it much harder to implement typical language-dependant solutions such as Word2Vec <ref type="bibr" target="#b2">[3]</ref> or WodrNet <ref type="bibr" target="#b3">[4]</ref>, since such resources are not readily available for languages other than English. At its core, our solution to Author Clustering task consists of two main components: Locality-sensitive hashing (LSH) and Stylometric Measures that are not language-specific.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Locality-sensitive hashing</head><p>The goal of Local-sensitive hashing (LSH) is to cluster items into "buckets" by approximating similarities between aforementioned items. This group of algorithms is widely used in tasks such as clustering and near-duplicates detection.</p><p>There are multiple LSH algorithms. During our research we tested two of them -MinHash <ref type="bibr" target="#b8">[9]</ref> and SuperBit <ref type="bibr" target="#b1">[2]</ref>. After multiple evaluations, SuperBit proved to be better suited for described task. This algorithm approximates cosine similarity between real-valued vectors and clusters them into given amount of clusters. The logic behind choosing this family of the algorithm is twofold: these algorithms have the reputation of being well suited for the task of clustering, we also wanted to test the tradeoff between their incredible speed and their effectiveness.</p><p>One of the main challenges of Author Clustering lies in establishing an optimal number of clusters since the count of clusters is not given a priori. Multiple solutions to this problem exist. Our final algorithm uses a process called silhouetting <ref type="bibr" target="#b10">[11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Stylometric Measures</head><p>Due to lack of language-dependant resources such as Word2Vec and WordNet for languages other than English, we decided to go with well known language-agnostic stylometric measures <ref type="bibr" target="#b4">[5]</ref> as well as a typical bag of word n-grams representation. For the same reason -no stemming or lemmatization is performed on the documents.</p><p>Each document is represented as a fixed-size, real-valued vector. First part of the vector is a bag of word 3-grams, where each coordinate corresponds to unique word 3-gram present in a whole document collection for given problem.</p><p>For the rest of the vector, the mixture of multiple lexical word and character based measures are used. During the research, multiple different measures were evaluated, but at the end, we decided to use: special character frequency, average word length, average sentence length in characters, average sentence length in words and vocabulary richness (number of unique words divided by the number of words). Our solution to author clustering and authorship link can be written in following steps: First, we approximate the desired amount of clusters using silhouetting, then we represent every document in a collection as a real-valued vector consisting of a bag of word 3-grams and multiple stylometric measures, then SuperBit LSH algorithms is used for the actual clustering procedure. Authorship link is calculated using cosine similarity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Results</head><p>2 Style Breach Detection</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Introduction</head><p>Style Breach Detection task consists in detecting borders where authorship may change within a document. Unlike the text segmentation problem which mainly focuses on finding switches of topics, whereas the point of style breach detection task lies in discovering borders using writing style features ignoring analysis the content of the text. We propose a statistical approach based on tf-idf features that characterize documents from widely different points of view: word n-grams (we consider only n = 1 and n = 3), punctuation, Part of Speech (PoS) using The Penn Treebank POS Tagger <ref type="bibr" target="#b11">[12]</ref>, stopwords, to determine the borders of changing style within a document.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">The Wilcoxon Signed Rank Test</head><p>The paired samples Wilcoxon signed-rank test is a nonparametric test which is used to verify the null hypothesis that two samples come from the same distribution <ref type="bibr" target="#b0">[1]</ref>.</p><p>Suppose we have a random sample of N pairs (X 1 , Y 1 ), . . . , (X N , Y N ), where X 1 , . . . , X n and Y 1 , . . . , Y n correspond to the blocks/objects effect before and after some activity, respectively. For each random sample the difference is formed as D i = X i − Y i . We assume the observation D 1 , . . . , D N are independent from a population which is continuous and symmetric with median M D . We verify the null hypothesis H 0 : M D = 0 against the two-sided alternative H 1 : M D = 0.</p><p>The algorithm to determine the statistic of this test is as follows: we need to order the absolute differences |D 1 |, . . . , |D n | from the smallest to the largest and assign them N integer ranks (from 1 to N ), noting the original signs of the differences D i . We consider the sum of ranks of the positive differences as a test criterion because the sum of all the ranks is a constant. If we denote r as the rank of a random variable, then the test statistic can be written as</p><formula xml:id="formula_0">T = n i=1 r(|D i |)I(D i &gt; 0),<label>(1)</label></formula><p>where I(ρ) = 1 if a sentence ρ is true and I(ρ) = 0 otherwise. We denote Z i by I(D i &gt; 0) for each i = 1, . . . , N . Under the null hypothesis the Z i are independent and identically distributed from Bernoulli population with probability P (Z i = 1) = 1  2 . The test statistic is a linear combination of Z i variables, so we could determine its expected value and variance as follows:</p><formula xml:id="formula_1">E(T ) = n(n + 1) 4 ,<label>(2)</label></formula><formula xml:id="formula_2">Var(T ) = n(n + 1)(2n + 1) 24 . (<label>3</label></formula><formula xml:id="formula_3">)</formula><p>We apply approximation based on the asymptotic normality of T due to lack of knowledge the exact distribution of this statistic. The following statistic:</p><formula xml:id="formula_4">T * = T − E(T ) Var(T )<label>(4)</label></formula><p>is asymptotically normal under H 0 . Let α denote an accepted significance level. We reject the null hypothesis against the two-sided alternative if |T * | ≥ z 1−α/2 , where z 1−α/2 is the (1 − α/2) th quantile from a normal distribution with mean 0 and standard deviation 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Tf-idf: Term frequency-inverse document frequency</head><p>Originally, tf-idf calculates values for each word in a document through an inverse proportion of the frequency of the word in a particular document to the percentage of documents the word appears in <ref type="bibr" target="#b9">[10]</ref>.</p><p>Formally, tf-idf is the product of term frequency and inverse document frequency. The term frequency is the number of times that i-th word occurs in j-th document, and it may be written as</p><formula xml:id="formula_5">tf i,j = n i,j k n k,j ,<label>(5)</label></formula><p>where n i,j is the number of occurrences the i-th word in the j-th document and the denominator is the sum of the number of occurrences of all words in the j-th document. The inverse document frequency is the logarithm of the inverse fraction of the documents that contain the i-th word:</p><formula xml:id="formula_6">idf i = log |D| |{d : w i ∈ d}| , (<label>6</label></formula><formula xml:id="formula_7">)</formula><p>where |D| is the number of all documents in the given corpus and the denominator is equal to the number of documents where the i-th word occurs at least once. Then, tf-idf for i-th word and the j-th document is as follows:</p><formula xml:id="formula_8">tf-idf i,j = tf i,j • idf i .<label>(7)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">The paired samples Wilcoxon Signed Rank test with tf-idf features to detect style breaches</head><p>The corpus used to construct our approach consists of only documents that are provided in English and may contain either zero or many style breaches which occur at the end sentences. Further, we noticed paragraphs are natural borders of the style breaches. On this account, we split each document into sections assuming nothing less than two blank lines determine the boundary between two paragraphs. If there are not any blank lines within a document, then m sentences are organized into a section, where m is a fixed natural number. Customarily, tf-idf is a numerical statistic that is intended to reflect how important a word is to a document in a corpus <ref type="bibr" target="#b8">[9]</ref>. In our approach, we use tf-idf to determine how important a particular term is to a paragraph in a document. For each document and each term mentioned above, we determine the tf-idf matrix X i , where we denote X 1 , X 2 , X 3 , X 4 , X 5 as the tf-idf matrix for word, punctuation, PoS, stopwords, word 3-grams, respectively. The number of rows of X i is equal to the number of paragraphs in a document, and the number of columns of this matrix is equal to the number of all unique terms in this document.</p><p>We computed vectors representing paragraphs as concatenated tf-idf vectors of selected terms together, it may be written as:</p><formula xml:id="formula_9">x k = (x k,j1 , . . . x k,js ), (j 1 , . . . , j s ) ⊂ {1, , 5},<label>(8)</label></formula><p>where we denote x k as tf-idf combining vector for the k-th paragraph as concatenated s tf-idf vectors of above-mentioned terms together (x k,j is tf-idf vector of the j-th term for the k-th paragraph). The primary aim of this approach is to test whether one or multi-authors wrote two following paragraphs. For this purpose, we use the paired samples Wilcoxon Signed Rank test which is used to verify if two samples come from the same distribution. We assume if the same author write two paragraphs they should have the same distribution and analogously if two paragraphs are not written by the same author they come from the different distributions. In other words, if the same author has drafted two sections the result of the test should not be statistically significant (the null hypothesis is accepted, the style is not changing between two consecutive paragraphs). On the other hand, if multi-authors write two paragraphs then the null hypothesis should be rejected (the style difference between two sections is statistically significant).</p><p>For each two consecutive paragraphs in a document, we test if these paragraphs have the same style. As the result of these tests, we note p-values. Next, we sort the p-values from smallest to largest value, and we determine the S lowest p-values, where S is defined as:</p><formula xml:id="formula_10">S = p • |P | + 1,<label>(9)</label></formula><p>where p is a fixed value that lies in [0, 1] and |P | is the number of paragraphs in a document.</p><p>The borders between paragraphs corresponding with selected p-values imply the style breaches. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5">Evaluations and Results</head><p>The main goal of training evaluations was to choose the set of values of the parameters used in our submitted solution. Keeping in mind the previous PAN's task -Intrinsic Plagiarism Detection task <ref type="bibr" target="#b7">[8]</ref>, we assumed that at least of 70% of each document was written by the one primary author, other 30% of a text could be written by other authors, eventually. Hence we fixed p as 0.3. Additionally, our initial experiments showed that best results were obtained for m = 10. Therefore, the principal evaluation to determine the optimal set of tf-idf features we performed for the parameters mentioned above. In Table <ref type="table" target="#tab_7">3</ref>, we showed the detailed results according to the subset of tf-idf features. It worth noticing that our primary intention was optimized the F-score of WinPR. Due to the similar results obtained on the training dataset, we select the subset of tf-idf features which also gives good results on other datasets, based on our previous experiences. For the final submission, we chose tf-idf of word, PoS and stopwords.</p><p>In Table <ref type="table" target="#tab_8">4</ref>, the official results were shown <ref type="bibr" target="#b5">[6]</ref>. Our submitted solution took the first place according to winF, winR, and runtime. The proposed approach optimizes recall at the sacrifice of precision and windowDiff (what was the main intention of our system).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Conclusion</head><p>We have presented methods for author identification task <ref type="bibr" target="#b12">[13]</ref> that we submitted to the 2017 PAN competition <ref type="bibr" target="#b6">[7]</ref>. This year the author identification task was divided into author clustering and style breach detection tasks. We proposed solutions for these competitions independently.</p><p>The submitted system for style breach detection task obtained the best result according to F-score of WinPR that it uses for the final ranking of all participating teams. Additionally, it is worth noticing we were building both of our algorithms bearing in mind optimizing execution time. Both systems had the shortest runtimes of all submitted solutions. Implementation of our solution of author clustering task achieved the fastest running time, which could be further improved if the number of clusters would be known a priori for each problem, since the routine of optimizing number of clusters for each problem is the most time-consuming step of the algorithm. While exhibiting remarkable running time, our algorithm did not perform substantially worse than other contestants. For the kind of usage cases that we are going to employ said algorithm for -the trade-off between running time and performance proved to be satisfying, which means we may use it in real-world scenarios after few improvements like using language-specific tools such as WordNet.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Results for PAN2017 training dataset</figDesc><table><row><cell cols="4">Problem Language Genre F-Bcubed R-Bcubed "P-Bcubed Av-Precision</cell></row><row><cell>problem001</cell><cell>en</cell><cell>articles 0.407890 0.344440 0.500000</cell><cell>0.032542</cell></row><row><cell>problem002</cell><cell>en</cell><cell>articles 0.383370 0.436670 0.341670</cell><cell>0.020267</cell></row><row><cell>problem003</cell><cell>en</cell><cell>articles 0.441710 0.354550 0.585710</cell><cell>0.031208</cell></row><row><cell>problem004</cell><cell>en</cell><cell>articles 0.494250 0.620000 0.410910</cell><cell>0.070715</cell></row><row><cell>problem005</cell><cell>en</cell><cell>articles 0.333330 1.000000 0.200000</cell><cell>0.127880</cell></row><row><cell>problem006</cell><cell>en</cell><cell>articles 0.600000 0.866670 0.458820</cell><cell>0.277360</cell></row><row><cell>problem007</cell><cell>en</cell><cell>articles 0.393570 1.000000 0.245000</cell><cell>0.235450</cell></row><row><cell>problem008</cell><cell>en</cell><cell>articles 0.731530 0.661110 0.818750</cell><cell>0.485970</cell></row><row><cell>problem009</cell><cell>en</cell><cell>articles 0.389530 0.363890 0.419050</cell><cell>0.023356</cell></row><row><cell>problem010</cell><cell>en</cell><cell>articles 0.428910 0.319050 0.654170</cell><cell>0.105910</cell></row><row><cell>problem011</cell><cell>en</cell><cell>reviews 0.473870 0.421150 0.541670</cell><cell>0.114890</cell></row><row><cell>problem012</cell><cell>en</cell><cell>reviews 0.677000 0.753330 0.614710</cell><cell>0.346780</cell></row><row><cell>problem013</cell><cell>en</cell><cell>reviews 0.473630 0.853330 0.327780</cell><cell>0.170070</cell></row><row><cell>problem014</cell><cell>en</cell><cell>reviews 0.405570 0.366670 0.453700</cell><cell>0.043251</cell></row><row><cell>problem015</cell><cell>en</cell><cell>reviews 0.509020 0.658930 0.414680</cell><cell>0.168070</cell></row><row><cell></cell><cell></cell><cell cols="2">Continued on next page</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1 -</head><label>1</label><figDesc>Results for PAN2017 training datasetProblem Language Genre F-Bcubed R-Bcubed "P-Bcubed Av-Precision</figDesc><table><row><cell>problem016</cell><cell>en</cell><cell>reviews 0.405020 0.600480 0.305560</cell><cell>0.142210</cell></row><row><cell>problem017</cell><cell>en</cell><cell>reviews 0.408400 0.443330 0.378570</cell><cell>0.065487</cell></row><row><cell>problem018</cell><cell>en</cell><cell>reviews 0.554640 0.493330 0.633330</cell><cell>0.028054</cell></row><row><cell>problem019</cell><cell>en</cell><cell>reviews 0.375870 0.789290 0.246670</cell><cell>0.179890</cell></row><row><cell>problem020</cell><cell>en</cell><cell>reviews 0.353110 0.820000 0.225000</cell><cell>0.070972</cell></row><row><cell>problem021</cell><cell>nl</cell><cell>articles 0.495550 0.497780 0.493330</cell><cell>0.063403</cell></row><row><cell>problem022</cell><cell>nl</cell><cell>articles 0.461920 0.387140 0.572500</cell><cell>0.094984</cell></row><row><cell>problem023</cell><cell>nl</cell><cell>articles 0.400250 0.735000 0.275000</cell><cell>0.073250</cell></row><row><cell>problem024</cell><cell>nl</cell><cell>articles 0.515130 0.518180 0.512120</cell><cell>0.219490</cell></row><row><cell>problem025</cell><cell>nl</cell><cell>articles 0.524570 0.733330 0.408330</cell><cell>0.125440</cell></row><row><cell>problem026</cell><cell>nl</cell><cell>articles 0.559890 0.446670 0.750000</cell><cell>0.170080</cell></row><row><cell>problem027</cell><cell>nl</cell><cell>articles 0.360600 0.457140 0.297730</cell><cell>0.042885</cell></row><row><cell>problem028</cell><cell>nl</cell><cell>articles 0.429240 0.420000 0.438890</cell><cell>0.032622</cell></row><row><cell>problem029</cell><cell>nl</cell><cell>articles 0.598770 0.746150 0.500000</cell><cell>0.273150</cell></row><row><cell>problem030</cell><cell>nl</cell><cell>articles 0.504400 0.426190 0.617780</cell><cell>0.147790</cell></row><row><cell>problem031</cell><cell>nl</cell><cell>reviews 0.497900 0.781250 0.365380</cell><cell>0.252900</cell></row><row><cell>problem032</cell><cell>nl</cell><cell>reviews 0.523900 0.468750 0.593750</cell><cell>0.078873</cell></row><row><cell>problem033</cell><cell>nl</cell><cell>reviews 0.412700 0.361110 0.481480</cell><cell>0.002976</cell></row><row><cell>problem034</cell><cell>nl</cell><cell>reviews 0.515000 0.678570 0.414970</cell><cell>0.178020</cell></row><row><cell>problem035</cell><cell>nl</cell><cell>reviews 0.474580 0.400000 0.583330</cell><cell>0.132480</cell></row><row><cell>problem036</cell><cell>nl</cell><cell>reviews 0.469260 0.416670 0.537040</cell><cell>0.004902</cell></row><row><cell>problem037</cell><cell>nl</cell><cell>reviews 0.322500 0.600000 0.220510</cell><cell>0.151300</cell></row><row><cell>problem038</cell><cell>nl</cell><cell>reviews 0.535290 0.433330 0.700000</cell><cell>0.028499</cell></row><row><cell>problem039</cell><cell>nl</cell><cell>reviews 0.463160 0.400000 0.550000</cell><cell>0.000000</cell></row><row><cell>problem040</cell><cell>nl</cell><cell>reviews 0.432780 0.683330 0.316670</cell><cell>0.196850</cell></row><row><cell>problem041</cell><cell>gr</cell><cell>articles 0.425240 0.636670 0.319230</cell><cell>0.090813</cell></row><row><cell>problem042</cell><cell>gr</cell><cell>articles 0.478660 0.595830 0.400000</cell><cell>0.131320</cell></row><row><cell>problem043</cell><cell>gr</cell><cell>articles 0.520610 0.761670 0.395450</cell><cell>0.163680</cell></row><row><cell>problem044</cell><cell>gr</cell><cell>articles 0.493880 0.728330 0.373610</cell><cell>0.197920</cell></row><row><cell>problem045</cell><cell>gr</cell><cell>articles 0.415200 0.520000 0.345560</cell><cell>0.042738</cell></row><row><cell>problem046</cell><cell>gr</cell><cell>articles 0.519860 0.700000 0.413460</cell><cell>0.171660</cell></row><row><cell>problem047</cell><cell>gr</cell><cell>articles 0.453640 0.691670 0.337500</cell><cell>0.163980</cell></row><row><cell>problem048</cell><cell>gr</cell><cell>articles 0.479610 0.660000 0.376670</cell><cell>0.102200</cell></row><row><cell>problem049</cell><cell>gr</cell><cell>articles 0.470300 0.500000 0.443940</cell><cell>0.108860</cell></row><row><cell>problem050</cell><cell>gr</cell><cell>articles 0.449520 0.383330 0.543330</cell><cell>0.131710</cell></row><row><cell>problem051</cell><cell>gr</cell><cell>reviews 0.480540 0.420830 0.560000</cell><cell>0.055130</cell></row><row><cell>problem052</cell><cell>gr</cell><cell>reviews 0.393060 0.636670 0.284290</cell><cell>0.093994</cell></row><row><cell>problem053</cell><cell>gr</cell><cell>reviews 0.534860 0.567500 0.505770</cell><cell>0.182710</cell></row><row><cell>problem054</cell><cell>gr</cell><cell>reviews 0.459390 0.551110 0.393850</cell><cell>0.105250</cell></row><row><cell>problem055</cell><cell>gr</cell><cell>reviews 0.509330 0.916670 0.352630</cell><cell>0.237980</cell></row><row><cell>problem056</cell><cell>gr</cell><cell>reviews 0.394480 0.593330 0.295450</cell><cell>0.042487</cell></row><row><cell></cell><cell></cell><cell cols="2">Continued on next page</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 -</head><label>1</label><figDesc>Results for PAN2017 training datasetProblem Language Genre F-Bcubed R-Bcubed "P-Bcubed Av-Precision</figDesc><table><row><cell>problem057</cell><cell>gr</cell><cell>reviews 0.365170 0.596670 0.263100</cell><cell>0.038210</cell></row><row><cell>problem058</cell><cell>gr</cell><cell>reviews 0.461150 0.437500 0.487500</cell><cell>0.063835</cell></row><row><cell>problem059</cell><cell>gr</cell><cell>reviews 0.515050 0.745830 0.393330</cell><cell>0.109060</cell></row><row><cell>problem060</cell><cell>gr</cell><cell>reviews 0.483030 0.630000 0.391670</cell><cell>0.044900</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 :</head><label>2</label><figDesc>Results for PAN2017 test dataset</figDesc><table><row><cell cols="4">Problem Language Genre F-Bcubed R-Bcubed "P-Bcubed Av-Precision</cell></row><row><cell>problem001</cell><cell>en</cell><cell>articles 0.645930 0.696670 0.602080</cell><cell>0.400580</cell></row><row><cell>problem002</cell><cell>en</cell><cell>articles 0.463950 0.383330 0.587500</cell><cell>0.081134</cell></row><row><cell>problem003</cell><cell>en</cell><cell>articles 0.418680 0.461900 0.382860</cell><cell>0.124740</cell></row><row><cell>problem004</cell><cell>en</cell><cell>articles 0.412690 0.543330 0.332690</cell><cell>0.083299</cell></row><row><cell>problem005</cell><cell>en</cell><cell>articles 0.628290 0.623330 0.633330</cell><cell>0.282090</cell></row><row><cell>problem006</cell><cell>en</cell><cell>articles 0.418510 0.398330 0.440830</cell><cell>0.060129</cell></row><row><cell>problem007</cell><cell>en</cell><cell>articles 0.423770 0.348720 0.540000</cell><cell>0.072016</cell></row><row><cell>problem008</cell><cell>en</cell><cell>articles 0.482420 0.460000 0.507140</cell><cell>0.079461</cell></row><row><cell>problem009</cell><cell>en</cell><cell>articles 0.776280 0.738890 0.817650</cell><cell>0.474400</cell></row><row><cell>problem010</cell><cell>en</cell><cell>articles 0.572720 0.516670 0.642420</cell><cell>0.165370</cell></row><row><cell>problem011</cell><cell>en</cell><cell>articles 0.462030 0.424290 0.507140</cell><cell>0.014544</cell></row><row><cell>problem012</cell><cell>en</cell><cell>articles 0.528660 0.575000 0.489230</cell><cell>0.123790</cell></row><row><cell>problem013</cell><cell>en</cell><cell>articles 0.450820 0.644440 0.346670</cell><cell>0.092703</cell></row><row><cell>problem014</cell><cell>en</cell><cell>articles 0.621250 0.633330 0.609620</cell><cell>0.205350</cell></row><row><cell>problem015</cell><cell>en</cell><cell>articles 0.424140 0.552380 0.344230</cell><cell>0.027974</cell></row><row><cell>problem016</cell><cell>en</cell><cell>articles 0.479660 0.658330 0.377270</cell><cell>0.154390</cell></row><row><cell>problem017</cell><cell>en</cell><cell>articles 0.487220 0.458330 0.520000</cell><cell>0.029075</cell></row><row><cell>problem018</cell><cell>en</cell><cell>articles 0.520000 0.433330 0.650000</cell><cell>0.022727</cell></row><row><cell>problem019</cell><cell>en</cell><cell>articles 0.446230 0.543330 0.378570</cell><cell>0.072511</cell></row><row><cell>problem020</cell><cell>en</cell><cell>articles 0.490040 0.485710 0.494440</cell><cell>0.100070</cell></row><row><cell>problem021</cell><cell>en</cell><cell>reviews 0.345450 0.950000 0.211110</cell><cell>0.221300</cell></row><row><cell>problem022</cell><cell>en</cell><cell>reviews 0.350800 0.512500 0.266670</cell><cell>0.066592</cell></row><row><cell>problem023</cell><cell>en</cell><cell>reviews 0.353910 1.000000 0.215000</cell><cell>0.272140</cell></row><row><cell>problem024</cell><cell>en</cell><cell>reviews 0.400190 0.600830 0.300000</cell><cell>0.116170</cell></row><row><cell>problem025</cell><cell>en</cell><cell>reviews 0.337180 0.875000 0.208820</cell><cell>0.065738</cell></row><row><cell>problem026</cell><cell>en</cell><cell>reviews 0.469780 0.508330 0.436670</cell><cell>0.034419</cell></row><row><cell>problem027</cell><cell>en</cell><cell>reviews 0.402840 0.522220 0.327880</cell><cell>0.053262</cell></row><row><cell>problem028</cell><cell>en</cell><cell>reviews 0.494430 0.600000 0.420450</cell><cell>0.040009</cell></row><row><cell>problem029</cell><cell>en</cell><cell>reviews 0.501390 0.720000 0.384620</cell><cell>0.061785</cell></row><row><cell>problem030</cell><cell>en</cell><cell>reviews 0.380680 0.860000 0.244440</cell><cell>0.078936</cell></row><row><cell>problem031</cell><cell>en</cell><cell>reviews 0.321360 0.218570 0.606670</cell><cell>0.021624</cell></row><row><cell>problem032</cell><cell>en</cell><cell>reviews 0.492580 0.673330 0.388330</cell><cell>0.227330</cell></row><row><cell>problem033</cell><cell>en</cell><cell>reviews 0.516360 0.708330 0.406250</cell><cell>0.153760</cell></row><row><cell></cell><cell></cell><cell cols="2">Continued on next page</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 2 -</head><label>2</label><figDesc>Results for PAN2017 test datasetProblem Language Genre F-Bcubed R-Bcubed "P-Bcubed Av-Precision</figDesc><table><row><cell>problem034</cell><cell>en</cell><cell>reviews 0.330710 0.230770 0.583330</cell><cell>0.011269</cell></row><row><cell>problem035</cell><cell>en</cell><cell>reviews 0.567830 0.578330 0.557690</cell><cell>0.187770</cell></row><row><cell>problem036</cell><cell>en</cell><cell>reviews 0.487760 0.697500 0.375000</cell><cell>0.114990</cell></row><row><cell>problem037</cell><cell>en</cell><cell>reviews 0.384690 0.415240 0.358330</cell><cell>0.033317</cell></row><row><cell>problem038</cell><cell>en</cell><cell>reviews 0.431790 0.369440 0.519440</cell><cell>0.051175</cell></row><row><cell>problem039</cell><cell>en</cell><cell>reviews 0.512680 0.407690 0.690480</cell><cell>0.070637</cell></row><row><cell>problem040</cell><cell>en</cell><cell>reviews 0.470650 0.834290 0.327780</cell><cell>0.218080</cell></row><row><cell>problem041</cell><cell>nl</cell><cell>articles 0.368420 0.233330 0.875000</cell><cell>0.136820</cell></row><row><cell>problem042</cell><cell>nl</cell><cell>articles 0.406060 0.574170 0.314100</cell><cell>0.190250</cell></row><row><cell>problem043</cell><cell>nl</cell><cell>articles 0.486960 0.700000 0.373330</cell><cell>0.270560</cell></row><row><cell>problem044</cell><cell>nl</cell><cell>articles 0.398870 0.371110 0.431110</cell><cell>0.063250</cell></row><row><cell>problem045</cell><cell>nl</cell><cell>articles 0.636000 0.750000 0.552080</cell><cell>0.313590</cell></row><row><cell>problem046</cell><cell>nl</cell><cell>articles 0.465100 0.390480 0.575000</cell><cell>0.074737</cell></row><row><cell>problem047</cell><cell>nl</cell><cell>articles 0.387770 0.320830 0.490000</cell><cell>0.020058</cell></row><row><cell>problem048</cell><cell>nl</cell><cell>articles 0.461540 1.000000 0.300000</cell><cell>0.369360</cell></row><row><cell>problem049</cell><cell>nl</cell><cell>articles 0.498750 0.525000 0.475000</cell><cell>0.105010</cell></row><row><cell>problem050</cell><cell>nl</cell><cell>articles 0.468940 0.342860 0.741670</cell><cell>0.059018</cell></row><row><cell>problem051</cell><cell>nl</cell><cell>articles 0.405010 0.397780 0.412500</cell><cell>0.102280</cell></row><row><cell>problem052</cell><cell>nl</cell><cell>articles 0.427850 0.466670 0.395000</cell><cell>0.018594</cell></row><row><cell>problem053</cell><cell>nl</cell><cell>articles 0.548000 0.694440 0.452560</cell><cell>0.228040</cell></row><row><cell>problem054</cell><cell>nl</cell><cell>articles 0.517640 0.637500 0.435710</cell><cell>0.116780</cell></row><row><cell>problem055</cell><cell>nl</cell><cell>articles 0.439160 0.563890 0.359620</cell><cell>0.046793</cell></row><row><cell>problem056</cell><cell>nl</cell><cell>articles 0.421090 0.440480 0.403330</cell><cell>0.062252</cell></row><row><cell>problem057</cell><cell>nl</cell><cell>articles 0.561150 1.000000 0.390000</cell><cell>0.332110</cell></row><row><cell>problem058</cell><cell>nl</cell><cell>articles 0.473750 0.620000 0.383330</cell><cell>0.162460</cell></row><row><cell>problem059</cell><cell>nl</cell><cell>articles 0.486730 0.533330 0.447620</cell><cell>0.042951</cell></row><row><cell>problem060</cell><cell>nl</cell><cell>articles 0.368980 0.415000 0.332140</cell><cell>0.050216</cell></row><row><cell>problem061</cell><cell>nl</cell><cell>reviews 0.498180 0.875000 0.348210</cell><cell>0.241950</cell></row><row><cell>problem062</cell><cell>nl</cell><cell>reviews 0.444350 0.712960 0.322750</cell><cell>0.134440</cell></row><row><cell>problem063</cell><cell>nl</cell><cell>reviews 0.377590 0.500000 0.303330</cell><cell>0.167750</cell></row><row><cell>problem064</cell><cell>nl</cell><cell>reviews 0.411470 0.468750 0.366670</cell><cell>0.042372</cell></row><row><cell>problem065</cell><cell>nl</cell><cell>reviews 0.443180 0.375000 0.541670</cell><cell>0.015341</cell></row><row><cell>problem066</cell><cell>nl</cell><cell>reviews 0.418950 0.593750 0.323660</cell><cell>0.157740</cell></row><row><cell>problem067</cell><cell>nl</cell><cell>reviews 0.533770 0.453700 0.648150</cell><cell>0.084187</cell></row><row><cell>problem068</cell><cell>nl</cell><cell>reviews 0.402930 0.333330 0.509260</cell><cell>0.017273</cell></row><row><cell>problem069</cell><cell>nl</cell><cell>reviews 0.466600 0.472220 0.461110</cell><cell>0.074354</cell></row><row><cell>problem070</cell><cell>nl</cell><cell>reviews 0.506730 0.531250 0.484380</cell><cell>0.073393</cell></row><row><cell>problem071</cell><cell>nl</cell><cell>reviews 0.517940 0.540000 0.497620</cell><cell>0.042484</cell></row><row><cell>problem072</cell><cell>nl</cell><cell>reviews 0.549850 0.583330 0.520000</cell><cell>0.095238</cell></row><row><cell>problem073</cell><cell>nl</cell><cell>reviews 0.545450 0.500000 0.600000</cell><cell>0.071429</cell></row><row><cell>problem074</cell><cell>nl</cell><cell>reviews 0.444440 0.400000 0.500000</cell><cell>0.000000</cell></row><row><cell></cell><cell></cell><cell cols="2">Continued on next page</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 2 -</head><label>2</label><figDesc>Results for PAN2017 test datasetProblem Language Genre F-Bcubed R-Bcubed "P-Bcubed Av-Precision</figDesc><table><row><cell>problem075</cell><cell>nl</cell><cell>reviews 0.515460 0.550000 0.485000</cell><cell>0.057313</cell></row><row><cell>problem076</cell><cell>nl</cell><cell>reviews 0.497810 0.775000 0.366670</cell><cell>0.109060</cell></row><row><cell>problem077</cell><cell>nl</cell><cell>reviews 0.574230 0.650000 0.514290</cell><cell>0.054895</cell></row><row><cell>problem078</cell><cell>nl</cell><cell>reviews 0.445810 0.475000 0.420000</cell><cell>0.004762</cell></row><row><cell>problem079</cell><cell>nl</cell><cell>reviews 0.294550 0.858330 0.177780</cell><cell>0.122440</cell></row><row><cell>problem080</cell><cell>nl</cell><cell>reviews 0.582650 0.483330 0.733330</cell><cell>0.010417</cell></row><row><cell>problem081</cell><cell>gr</cell><cell>articles 0.500940 0.900000 0.347060</cell><cell>0.183760</cell></row><row><cell>problem082</cell><cell>gr</cell><cell>articles 0.479490 0.425000 0.550000</cell><cell>0.051942</cell></row><row><cell>problem083</cell><cell>gr</cell><cell>articles 0.562320 0.584620 0.541670</cell><cell>0.235550</cell></row><row><cell>problem084</cell><cell>gr</cell><cell>articles 0.454610 0.541670 0.391670</cell><cell>0.080786</cell></row><row><cell>problem085</cell><cell>gr</cell><cell>articles 0.404820 0.491670 0.344050</cell><cell>0.099311</cell></row><row><cell>problem086</cell><cell>gr</cell><cell>articles 0.365330 0.454170 0.305560</cell><cell>0.123180</cell></row><row><cell>problem087</cell><cell>gr</cell><cell>articles 0.317580 0.504760 0.231670</cell><cell>0.035122</cell></row><row><cell>problem088</cell><cell>gr</cell><cell>articles 0.523250 0.710000 0.414290</cell><cell>0.066747</cell></row><row><cell>problem089</cell><cell>gr</cell><cell>articles 0.795180 1.000000 0.660000</cell><cell>0.536570</cell></row><row><cell>problem090</cell><cell>gr</cell><cell>articles 0.662110 0.825000 0.552940</cell><cell>0.338080</cell></row><row><cell>problem091</cell><cell>gr</cell><cell>articles 0.650880 0.620000 0.685000</cell><cell>0.352410</cell></row><row><cell>problem092</cell><cell>gr</cell><cell>articles 0.519040 0.857140 0.372220</cell><cell>0.277550</cell></row><row><cell>problem093</cell><cell>gr</cell><cell>articles 0.544930 0.526320 0.564910</cell><cell>0.111680</cell></row><row><cell>problem094</cell><cell>gr</cell><cell>articles 0.496540 0.610710 0.418330</cell><cell>0.198450</cell></row><row><cell>problem095</cell><cell>gr</cell><cell>articles 0.383130 0.530000 0.300000</cell><cell>0.131270</cell></row><row><cell>problem096</cell><cell>gr</cell><cell>articles 0.407150 0.291730 0.673680</cell><cell>0.035518</cell></row><row><cell>problem097</cell><cell>gr</cell><cell>articles 0.577060 0.444120 0.823610</cell><cell>0.285260</cell></row><row><cell>problem098</cell><cell>gr</cell><cell>articles 0.429490 0.775000 0.297060</cell><cell>0.165900</cell></row><row><cell>problem099</cell><cell>gr</cell><cell>articles 0.457100 0.737140 0.331250</cell><cell>0.166660</cell></row><row><cell>problem100</cell><cell>gr</cell><cell>articles 0.435760 0.441670 0.430000</cell><cell>0.039691</cell></row><row><cell>problem101</cell><cell>gr</cell><cell>reviews 0.365790 0.566900 0.270000</cell><cell>0.143080</cell></row><row><cell>problem102</cell><cell>gr</cell><cell>reviews 0.405040 0.434440 0.379370</cell><cell>0.070181</cell></row><row><cell>problem103</cell><cell>gr</cell><cell>reviews 0.419470 0.733330 0.293750</cell><cell>0.132500</cell></row><row><cell>problem104</cell><cell>gr</cell><cell>reviews 0.495240 0.650000 0.400000</cell><cell>0.154300</cell></row><row><cell>problem105</cell><cell>gr</cell><cell>reviews 0.515040 0.557140 0.478850</cell><cell>0.123990</cell></row><row><cell>problem106</cell><cell>gr</cell><cell>reviews 0.495700 0.708330 0.381250</cell><cell>0.099837</cell></row><row><cell>problem107</cell><cell>gr</cell><cell>reviews 0.485800 0.440380 0.541670</cell><cell>0.145940</cell></row><row><cell>problem108</cell><cell>gr</cell><cell>reviews 0.426770 0.640480 0.320000</cell><cell>0.213520</cell></row><row><cell>problem109</cell><cell>gr</cell><cell>reviews 0.452050 0.361430 0.603330</cell><cell>0.200960</cell></row><row><cell>problem110</cell><cell>gr</cell><cell>reviews 0.377070 0.583330 0.278570</cell><cell>0.155800</cell></row><row><cell>problem111</cell><cell>gr</cell><cell>reviews 0.384740 0.775000 0.255880</cell><cell>0.093850</cell></row><row><cell>problem112</cell><cell>gr</cell><cell>reviews 0.430200 0.500000 0.377500</cell><cell>0.066225</cell></row><row><cell>problem113</cell><cell>gr</cell><cell>reviews 0.397240 0.622620 0.291670</cell><cell>0.181710</cell></row><row><cell>problem114</cell><cell>gr</cell><cell>reviews 0.356770 0.716670 0.237500</cell><cell>0.058730</cell></row><row><cell>problem115</cell><cell>gr</cell><cell>reviews 0.374060 0.808330 0.243330</cell><cell>0.097110</cell></row><row><cell></cell><cell></cell><cell cols="2">Continued on next page</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 2 -</head><label>2</label><figDesc>Results for PAN2017 test dataset</figDesc><table><row><cell cols="4">Problem Language Genre F-Bcubed R-Bcubed "P-Bcubed Av-Precision</cell></row><row><cell>problem116</cell><cell>gr</cell><cell>reviews 0.430040 0.640000 0.323810</cell><cell>0.119100</cell></row><row><cell>problem117</cell><cell>gr</cell><cell>reviews 0.431020 0.478330 0.392220</cell><cell>0.008253</cell></row><row><cell>problem118</cell><cell>gr</cell><cell>reviews 0.452780 0.737500 0.326670</cell><cell>0.108080</cell></row><row><cell>problem119</cell><cell>gr</cell><cell>reviews 0.420360 0.795000 0.285710</cell><cell>0.085212</cell></row><row><cell>problem120</cell><cell>gr</cell><cell>reviews 0.470340 0.620830 0.378570</cell><cell>0.145050</cell></row><row><cell cols="2">1.5 Method summary</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 3 .</head><label>3</label><figDesc>Results for training evaluations according to subsets of tf-idf features, m and p are fixed (m = 10 and p = 0.3).</figDesc><table><row><cell>Combine features</cell><cell cols="2">WindowDiff WinP</cell><cell>WinR</cell><cell>WinF</cell></row><row><cell>[X2, X4, X5]</cell><cell>0.526434</cell><cell cols="3">0.344312 0.620210 0.342847</cell></row><row><cell>[X4, X5]</cell><cell>0.527448</cell><cell cols="3">0.343365 0.619061 0.341818</cell></row><row><cell>[X3, X4, X5]</cell><cell>0.525729</cell><cell cols="3">0.343161 0.617870 0.341496</cell></row><row><cell>[X2, X3, X4, X5]</cell><cell>0.526380</cell><cell cols="3">0.341384 0.616980 0.340310</cell></row><row><cell>[X1, X4]</cell><cell>0.534279</cell><cell cols="3">0.339005 0.617799 0.337278</cell></row><row><cell>[X1, X3, X4]</cell><cell>0.535459</cell><cell cols="3">0.336084 0.612633 0.333563</cell></row><row><cell>[X1, X3, X5]</cell><cell>0.532014</cell><cell cols="3">0.332278 0.613327 0.333267</cell></row><row><cell>[X1, X5]</cell><cell>0.532141</cell><cell cols="3">0.331560 0.614199 0.333213</cell></row><row><cell>[X5]</cell><cell>0.533403</cell><cell cols="3">0.333675 0.610578 0.333127</cell></row><row><cell>[X1, X2, X3, X5]</cell><cell>0.532065</cell><cell cols="3">0.332111 0.613266 0.333075</cell></row><row><cell>[X1, X2, X4, X5]</cell><cell>0.534239</cell><cell cols="3">0.331392 0.613619 0.332709</cell></row><row><cell>[X1, X4, X5]</cell><cell>0.533170</cell><cell cols="3">0.331391 0.613619 0.332707</cell></row><row><cell>[X1, X2, X5]</cell><cell>0.532724</cell><cell cols="3">0.331013 0.613450 0.332558</cell></row><row><cell>[X2, X3, X5]</cell><cell>0.534358</cell><cell cols="3">0.332229 0.610030 0.332168</cell></row><row><cell>[X3, X5]</cell><cell>0.533902</cell><cell cols="3">0.332250 0.609848 0.332164</cell></row><row><cell>[X2, X5]</cell><cell>0.534230</cell><cell cols="3">0.331857 0.609818 0.331915</cell></row><row><cell>[X1, X2, X3, X4]</cell><cell>0.536221</cell><cell cols="3">0.334333 0.610863 0.331759</cell></row><row><cell>[X1, X3]</cell><cell>0.534803</cell><cell cols="3">0.326948 0.615239 0.331622</cell></row><row><cell>[X2, X4]</cell><cell>0.537484</cell><cell cols="3">0.332465 0.615676 0.331344</cell></row><row><cell>[X1, X3, X4, X5]</cell><cell>0.534146</cell><cell cols="3">0.330113 0.611450 0.331102</cell></row><row><cell cols="2">[X1, X2, X3, X4, X5] 0.534129</cell><cell cols="3">0.330113 0.611450 0.331102</cell></row><row><cell>[X1, X2, X4]</cell><cell>0.538384</cell><cell cols="3">0.332572 0.611185 0.330859</cell></row><row><cell>[X3]</cell><cell>0.539429</cell><cell cols="3">0.327774 0.608097 0.330372</cell></row><row><cell>[X4]</cell><cell>0.534342</cell><cell cols="3">0.331434 0.609035 0.329315</cell></row><row><cell>[X1, X2]</cell><cell>0.541324</cell><cell cols="3">0.325407 0.611997 0.328848</cell></row><row><cell>[X1, X2, X3]</cell><cell>0.537711</cell><cell cols="3">0.323988 0.612137 0.328599</cell></row><row><cell>[X1]</cell><cell>0.541647</cell><cell cols="3">0.321542 0.609613 0.325763</cell></row><row><cell>[X2, X3]</cell><cell>0.540967</cell><cell cols="3">0.322548 0.606527 0.325516</cell></row><row><cell>[X2, X3, X4]</cell><cell>0.542909</cell><cell cols="3">0.326722 0.605585 0.324760</cell></row><row><cell>[X3, X4]</cell><cell>0.541420</cell><cell cols="3">0.326358 0.603533 0.323806</cell></row><row><cell>[X2]</cell><cell>0.561822</cell><cell cols="3">0.312071 0.599578 0.315449</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 4 .</head><label>4</label><figDesc>Official results for PAN2017 test dataset</figDesc><table><row><cell>Team</cell><cell>winF</cell><cell>winP</cell><cell cols="2">winR windowDiff Runtime</cell></row><row><cell>OPI-JSA</cell><cell cols="3">0.322601 0.314656 0.585617 0.545648</cell><cell>00:01:19</cell></row><row><cell>khan17</cell><cell cols="3">0.288795 0.399004 0.487075 0.479990</cell><cell>00:02:23</cell></row><row><cell cols="4">kuznetsova17 0.277264 0.371108 0.542527 0.529496</cell><cell>00:20:25</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Nonparametric statistical inference</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Gibbons</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chakraborti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Encyclopedia of Statistical Science</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Lovric</surname></persName>
		</editor>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="977" to="979" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Super-bit locality-sensitive hashing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Tian</surname></persName>
		</author>
		<ptr target="http://papers.nips.cc/paper/4847-super-bit-locality-sensitive-hashing.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 25</title>
				<editor>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><forename type="middle">J C</forename><surname>Burges</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="108" to="116" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno>CoRR abs/1301.3781</idno>
		<ptr target="http://arxiv.org/abs/1301.3781" />
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Wordnet: A lexical database for english</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Miller</surname></persName>
		</author>
		<idno type="DOI">10.1145/219717.219748</idno>
		<ptr target="http://doi.acm.org/10.1145/219717.219748" />
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">38</biblScope>
			<biblScope unit="issue">11</biblScope>
			<biblScope unit="page" from="39" to="41" />
			<date type="published" when="1995-11">Nov 1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Identification of author personality traits using stylistic features: Notebook for pan at clef 2015</title>
		<author>
			<persName><forename type="first">I</forename><surname>Pervaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Ameer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sittar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">M A</forename><surname>Nawab</surname></persName>
		</author>
		<ptr target="http://dblp.uni-trier.de/db/conf/clef/clef2015w.htmlPervazASN15" />
	</analytic>
	<monogr>
		<title level="m">CLEF (Working Notes)</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">G</forename><forename type="middle">J F</forename><surname>Jones</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Sanjuan</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="volume">1391</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Improving the Reproducibility of PAN&apos;s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Clough</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sanderson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hall</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hanbury</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Toms</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014-09">Sep 2014</date>
			<biblScope unit="page" from="268" to="299" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Overview of PAN&apos;17: Author Identification, Author Profiling, and Author Obfuscation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17)</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Jones</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Lawless</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017-09">Sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Overview of the 1st International Competition on Plagiarism Detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Eiselt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Barrón-Cedeño</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org/Vol-502" />
	</analytic>
	<monogr>
		<title level="m">SEPLN 09 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Agirre</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2009-09">Sep 2009</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Mining of Massive Datasets</title>
		<author>
			<persName><forename type="first">A</forename><surname>Rajaraman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Ullman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2011">2011</date>
			<publisher>Cambridge University Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Ramos</surname></persName>
		</author>
		<title level="m">Using tf-idf to determine word relevance in document queries</title>
				<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Silhouettes: A graphical aid to the interpretation and validation of cluster analysis</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Rousseeuw</surname></persName>
		</author>
		<ptr target="http://www.sciencedirect.com/science/article/pii/0377042787901257" />
	</analytic>
	<monogr>
		<title level="j">Journal of Computational and Applied Mathematics</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="53" to="65" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Part-of-speech tagging guidelines for the Penn Treebank Project</title>
		<author>
			<persName><forename type="first">B</forename><surname>Santorini</surname></persName>
		</author>
		<ptr target="ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz" />
		<imprint>
			<date type="published" when="1990">1990</date>
		</imprint>
		<respStmt>
			<orgName>Department of Computer and Information Science, University of Pennsylvania</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Tech. Rep. MS-CIS-90-47</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Verhoeven</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Specht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<title level="m">Working Notes Papers of the CLEF 2017 Evaluation Labs</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
