<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Author Clustering with the Aid of a Simple Distance Measure Notebook for PAN at CLEF 2017</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Houda</forename><surname>Alberts</surname></persName>
							<email>houda1996@hotmail.com</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Amsterdam</orgName>
								<address>
									<country key="NL">the Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Author Clustering with the Aid of a Simple Distance Measure Notebook for PAN at CLEF 2017</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">055B9B1C648CFB582CA2F21B1B57E6B5</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>A simple distance measure has been applied to the author clustering problem to determine which documents are written by the same author. This simple distance measure works with the probability distribution of character sequences of a document, making it insensitive to language differences. The top most frequent features k, where k is chosen to be 300, determine the distribution where punctuation is present. Also, the uppercase letters are transformed to lowercase symbols, while a threshold of 3.0 remains for the symmetric distance score. In addition, character 2-grams are chosen due to their best outcomes. Using the BCubed F-score provided, it achieves a score of 0.54 on the training set and a score of 0.53 on the test set with a relative low MAP score. Obtaining clusters from links still shows problems.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Authorship identification is an important aspect within stylometry that can be applied to many scenarios. For example, determining the author of a ransom note can save someone's life, discovering whether all the uploaded assignments of a student are classified as their own work can reduce the amount of plagiarism, but it can be also applied in arts to identify an author of an old text <ref type="bibr" target="#b4">[5]</ref>. While some of these illustrations are based on authorship verification, finding out whether someone is the author belongs to the author clustering problem as well.</p><p>This author clustering problem <ref type="bibr" target="#b9">[10]</ref> is the task of partitioning a given set of documents in such a way that all documents in each partition are written by the same author, and clusters are maximal with respect to this property. These sets of documents vary in genre, language, and the number of authors. Two parts are present during this task: the establishment of links between documents (denoting that two documents are written by same author) and the clustering task.</p><p>Based on the two best performing systems from <ref type="bibr" target="#b2">[3]</ref> and <ref type="bibr" target="#b5">[6]</ref>, it can be concluded that a recurrent neural network and a simple clustering algorithm can obtain good results. Since the focal point of this paper will be language processing and feature extraction, and not machine learning, the simple distance measure from <ref type="bibr" target="#b5">[6]</ref> will be adapted and explored further instead of the recurrent neural network from <ref type="bibr" target="#b2">[3]</ref>. This paper will start with a short summary about the training data and the evaluation measures. Afterwards, the applied distance measure Spatium-L1 <ref type="bibr" target="#b5">[6]</ref> will be discussed with the needed text processing in depth. Lastly, the results will be discussed while also concluding with alternative options for future research to obtain better performance.</p><p>2 Author Clustering Task at PAN-2017</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Training data</head><p>First, the training data is explored. This data is provided by PAN <ref type="bibr" target="#b8">[9]</ref> to apply text forensics. The data consists of problem sets in three languages: Dutch, English and Greek where each language is represented in two genres, reviews and articles. For these six possibilities, 10 problem sets are involved with different sizes. The size of each text snippet is also variable, meaning that there are different amounts of clusters, various authors who made a single document in a set and other quantities of (unique) words. The Greek review documents are the smallest in mean (8021/2000 ≈ 40.1 words), while the documents for Dutch reviews are in comparison much larger (25583/182 ≈ 140.5 words), meaning that measures focusing on only small texts for example would not be sufficient for each part of the data. All these facts can additionally be found in Table <ref type="table" target="#tab_0">1</ref>. The clusters size 1 column represents how many authors only contributed to one document within a problem, e.g clusters of size 1. When a program is built, it can be evaluated with the TIRA evaluation software <ref type="bibr" target="#b7">[8]</ref> that runs the code from a virtual machine. To have the program working correctly for it to work on TIRA, it is necessary to output the results in a specific manner. The found clusters should be constructed as a nested list of documents belonging to one cluster and stored in a clustering.json file in the folder belonging to that problem. The links should be written to a file, in decreasing order (e.g highest first), with their corresponding scores in the ranking.json file.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Evaluation</head><p>For this year's evaluation, two measures are introduced. The first one is the BCubed Fscore from <ref type="bibr" target="#b1">[2]</ref> based on the regular F-score in information retrieval. What this BCubed F-score does, is taking the average F-scores of each class to obtain an evaluation on both the complete outcome and the separate clusters. The F-score is calculated as the harmonic mean of both the precision and recall, where these are BCubed as well. This evaluation measure however, only judges the clusters that are acquired and not the links made.</p><p>To assess the performance of the created links, the mean average precision (MAP) score from <ref type="bibr" target="#b6">[7]</ref> is determined. This technique takes the precision value for each query, a problem set in this case, and converts this to the average precision of each query. The importance of this evaluation method lies within an ordered outcome. The links that the program returns are ordered by score to have the most important scores at the top and the lower ones at the bottom. The MAP handles this valuable information by calculating the precision based on whether it appears at the right place as well.</p><p>3 Our Method</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Text Processing</head><p>The first step to obtain the character N -gram features is based on text processing. The text is lowercased to reduce the amount of possible features that can be extracted. All other symbols, including punctuation, are left within the text, just as the more frequent terms. Lemmatization, stemming or any other method to smooth the amount of words into more general version of the word within a text are not applied.</p><p>Mainly the more frequent terms within a text, the function words, say more about the writing style of a person <ref type="bibr" target="#b4">[5]</ref>. Keeping this information in mind, more frequent terms should not be removed or under-weighted. Features are then extracted by using character N -grams. These character N -grams do include punctuation and spaces and have length N . For example, the sentence: Hello, have you seen Alice? would also yield ello,, o, ha and Alice as character 5-grams. The submitted software uses character 2-grams since those features yielded the best result out of all character N -grams where N ranges from 1 to 6. Unique character N -grams can remain within these features and could possibly be disregarded by choosing only the top k most frequent character N -grams in the text. <ref type="bibr" target="#b5">[6]</ref> proposes, with evidence from Burrows and Savoy, that values between 200 and 300 for k yield the best outcomes, so these will be included as probable values. This results in frequencies that are then converted into probabilities by the relative frequency, as mentioned in <ref type="bibr" target="#b0">(1)</ref> where P (t) is the chance of feature t and T is the set of all features. Due to the normalization over all features, the sum of all elements within the probability distribution adds up to 1.</p><formula xml:id="formula_0">P (t) = frequency t t ∈T frequency t<label>(1)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Simple Distance Measure</head><p>When the documents are converted to probability distributions as discussed, the simple distance measure from <ref type="bibr" target="#b5">[6]</ref>, Spatium-L1, can be applied. It takes the absolute differences of the two vectors element-wise and sums it up, as shown in <ref type="bibr" target="#b1">(2)</ref>, where the range k is defined by the features from the first probability distribution.</p><formula xml:id="formula_1">∆(P doc1 , P doc2 ) = ∆ 12 = n i=1 |P doc1 (i) − P doc2 (i)|<label>(2)</label></formula><p>After obtaining these summations, these scores are transformed to standard deviations, where a high standard deviation score yields more evidence that the two documents are written by the same author meaning that it becomes a similarity measure instead of dissimilarity measure. This is done by first calculating the average of each document, which is the average distance from a document to all other documents within the problem set. The standard deviation of this document is then also determined by comparing all the distances from a single document to all other documents with the average distance for that single document. For example, if ∆ 12 = 40, the average from document 1 to all other documents is 50 and the standard deviation for this document has the value 5, the score can be calculated as follows: 40−50 5 = 2, making this the new updated standard deviation score for document 1 and document 2.</p><p>Since these standard deviations are based on the features from the probability distribution of the first document (e.g the most frequent top k features), the values for ∆(doc 1 , doc 2 ) are not the same as for ∆(doc 2 , doc 1 ) (i.e., ∆ 12 = ∆ 21 ) making these scores and the scores converted to standard deviations non-symmetric. This shows that using only one score, e.g one direction, would not be sufficient enough to say something about the link. Therefore to keep the simple characteristic of this technique, the symmetric standard deviation score is computed by the addition of both the non-symmetric scores of the two documents. For instance, if standard deviation score 12 = 3 and standard deviation score 21 = 2, the symmetric score will be 3 + 2 = 5. Lastly, this symmetric standard deviation score will be compared against an arbitrary threshold that will indicate whether the two documents are written by the same author if the score is higher than the threshold, otherwise they are not written by the same author. Afterwards these symmetric standard deviation scores are scaled down to a value between 0 and 1 with the aid of the maximum and minimum obtained scores as shown in 3.</p><formula xml:id="formula_2">score i = distance i − min(distance) max(distance) − min(distance)<label>(3)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Clustering</head><p>When Spatium-L1 has been applied on the data, several links or none at all are found. These are then used to cluster the documents to find which sets of documents are written by the same author. This done with the use of connected components. All the possible documents are added as nodes within a graph and then all the links found by the measure will be given to the graph as well. The connected components option then returns clusters of documents where a cluster contains documents that are all linked by each other with at least one link. For instance, if document 1 and 2, document 2 and 3, and document 3 and 4 are all linked, they would form the cluster containing document 1, 2, 3 and 4.</p><p>Another example can be found in Figure <ref type="figure" target="#fig_0">1</ref>, where it can be seen that the upper-right document is connected to the upper-left document by documents in between, which still indicates a connection making the purple cluster. This shows that every link will considered equally strong, while it may be that the connection between the upper-right document and the middle document is not as strong as it should be to be considered a part of the cluster which is its downside. When running the simple distance measure on the training data, the highest possible BCubed F-score is achieved by setting k equal to 300 and the threshold to 3.0 with a score of 0.54, while using character 2-grams. Different values for N (1 till 6) were explored for the character N -grams, but for N = 2, the best results were obtained.</p><p>The results for the other character N -grams are therefore disregarded. The results, the BCubed F-Score and MAP value, can be found in Figures <ref type="figure" target="#fig_1">2a and 2b</ref> respectively for different values of k and for different threshold between 2.0 and 6.0. It can be concluded that the best outcomes for the BCubed F-Score are found when k = 300 and the threshold is 3.0 as mentioned before, not too high but not too low either. However, the MAP score is maximized for a different threshold, which is 2.0. This can be explained due to a lower threshold being more lenient with assessing links and returning more links as a result yielding a better performance on this part. Although, this is more beneficial for the MAP score, it is not for the F-Score, which should have a higher threshold to sustain sufficient outcomes which is why the software has been submitted with with the parameters achieving the best BCubed F-score. For the outcomes on the test data, the performance does differ greatly between all the participants as shown in Table <ref type="table" target="#tab_1">2</ref>. Whereas the BCubed F-score of my software shows a good result marked in pink, the MAP score is quite low showing there can be more improvement on that part compared to the participant with the best result in yellow. It does however show BCubed F-scores that are way lower than last year, which may be due to the differences in those datasets. Last year, the dataset contained larger problem sets and the text was also much longer (three paragraphs for example) meaning that more evidence could be found about someone's writing style in the text, while this was not the case this year with smaller problem sets and less text in the documents. This shows there should be a robust measure or algorithm to solve this problem to work with both kind of datasets. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion &amp; Discussion</head><p>To conclude, using the top 300 most frequent terms of each document and character 2-grams, a BCubed F-score of 0.54 is achieved on the, provided, training data with a threshold of 3.0 whereas the same parameters obtain a BCubed F-score of 0.53 on the test set. While this method does show some performance, it is still an inefficient result. Based on the current technique, there are still opportunities for further enhancements. For example, only character N -grams, where N lies between 1 and 6, are tested on performance. Higher values for N or word grams may improve the outcome. Another improvement lies within the text processing, that currently only consists of converting all the text to lowercase. Using uppercase characters, replacing unique words with a specific symbol or removing punctuation may all be improvements on the data. Next, the parameters could be separately set for each language and genre. Linguistic features of a language can influence the recognition of writing styles, meaning that these separate parameters could improve the outcome. However, this could not be tested for this submission due to time constraints.</p><p>Lastly, the clustering aspect, from the found links, is not the most optimal solution. By applying the connected components method blindly, weak links may bind two separate clusters wrongly. An alternative lies within the possibility to using weighted links when extracting the clusters such as the Louvain modularity from <ref type="bibr" target="#b3">[4]</ref>. By doing so, weak links between strong clusters can be discarded to obtain better performance.</p><p>Since this task is also the topic of my bachelor thesis, I continued with the research after the submission trying some of the previous mentioned alternatives to improve the results. Using an other more sophisticated distance measure, the Jensen-Shannon Divergence, transpired to work much better obtaining BCubed F-scores around 0.53 and MAP scores exceeding 0.3. Word N -grams are not an improvement, while the Louvain modularity does do more sophisticated clustering, but the found links must be more precise. The text processing alternatives also do improve the quality of the links, meaning a higher MAP score is achieved. With these results, it can also be concluded that more sophisticated distance measures are worth looking into.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 .</head><label>1</label><figDesc>Figure 1. The connected components clustering option; all connected documents make a cluster indicated with a single color</figDesc><graphic coords="5,240.41,237.13,132.30,156.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 .</head><label>2</label><figDesc>Figure 2. BCubed F-Scores and MAP values for different k values per threshold for character 2-grams on the training set</figDesc><graphic coords="6,169.18,304.61,274.75,162.86" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Statistics training data 2017 in absolute amounts</figDesc><table><row><cell></cell><cell></cell><cell>Total</cell><cell></cell><cell></cell></row><row><cell>Corpus</cell><cell cols="4">Problem Sets Texts Clusters size 1 Clusters Words Unique Words</cell></row><row><cell cols="2">Dutch Reviews 10</cell><cell>182 3</cell><cell>65</cell><cell>25583 4607</cell></row><row><cell cols="2">Dutch Articles 10</cell><cell>200 20</cell><cell>53</cell><cell>10543 3305</cell></row><row><cell cols="2">English Reviews 10</cell><cell>194 19</cell><cell>61</cell><cell>12073 3547</cell></row><row><cell cols="2">English Articles 10</cell><cell>200 18</cell><cell>56</cell><cell>10529 3425</cell></row><row><cell cols="2">Greek Reviews 10</cell><cell>200 21</cell><cell>61</cell><cell>8021 3187</cell></row><row><cell cols="2">Greek Articles 10</cell><cell>200 15</cell><cell>60</cell><cell>9824 2840</cell></row><row><cell>Total</cell><cell>60</cell><cell>1176 96</cell><cell>356</cell><cell>76573 20911</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Outcomes PAN competition 2017 on test set</figDesc><table><row><cell>Participant</cell><cell cols="2">BCubed F-score MAP Runtime</cell></row><row><cell>Alberts</cell><cell>0.53</cell><cell>0.04 00:01:45</cell></row><row><cell cols="2">Gómez-Adorno et al. 0.57</cell><cell>0.46 00:02:05</cell></row><row><cell>García et al.</cell><cell>0.56</cell><cell>0.38 00:15:49</cell></row><row><cell>Halvani &amp; Graner</cell><cell>0.55</cell><cell>0.14 00:12:25</cell></row><row><cell>Karaś</cell><cell>0.47</cell><cell>0.13 00:00:26</cell></row><row><cell>Kocher &amp; Savoy</cell><cell>0.55</cell><cell>0.40 00:00:41</cell></row></table></figure>
		</body>
		<back>

			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This work was done as part of my bachelor thesis Artificial Intelligence <ref type="bibr" target="#b0">[1]</ref> under the supervision of Maarten Marx</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Authorship Identification by Author Clustering</title>
		<author>
			<persName><forename type="first">H</forename><surname>Alberts</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">Unpublished bachelor thesis</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A comparison of extrinsic clustering evaluation metrics based on formal constraints</title>
		<author>
			<persName><forename type="first">E</forename><surname>Amigó</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Artiles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Verdejo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Retrieval</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="461" to="486" />
			<date type="published" when="2009-08">Aug 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Authorship clustering using multi-headed recurrent neural networks</title>
		<author>
			<persName><forename type="first">D</forename><surname>Bagnall</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
				<imprint>
			<publisher>Citeseer</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="791" to="804" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Fast unfolding of communities in large networks</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">D</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Guillaume</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lambiotte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lefebvre</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Statistical Mechanics: Theory and Experiment</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">10008</biblScope>
			<date type="published" when="2008-10">Oct 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stronks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Bruin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>De Winkel</surname></persName>
		</author>
		<title level="m">Van wie is het wilhelmus</title>
				<imprint>
			<publisher>Amsterdam University Press</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Unine at clef 2016: Author clustering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kocher</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Working Notes of CLEF 2016 -Conference and Labs of the Evaluation forum</title>
				<imprint>
			<publisher>Citeseer</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="895" to="902" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Raghavan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<title level="m">Introduction to information retrieval</title>
				<imprint>
			<publisher>Cambridge university press Cambridge</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Improving the Reproducibility of PAN&apos;s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Clough</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sanderson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hall</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hanbury</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Toms</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014-09">Sep 2014</date>
			<biblScope unit="page" from="268" to="299" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Overview of PAN&apos;17: Author Identification, Author Profiling, and Author Obfuscation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Initiative (CLEF 17)</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Jones</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Lawless</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Gonzalo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Kelly</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017-09">Sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Overview of the Author Identification Task at PAN 2017: Style Breach Detection and Author Clustering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Verhoeven</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Specht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF 2017 Evaluation Labs</title>
		<title level="s">CEUR Workshop Proceedings, CLEF and CEUR-WS</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Goeuriot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Mandl</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2017-09">Sep 2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
