<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">CIC-GIL Approach to Cross-domain Authorship Attribution Notebook for PAN at CLEF 2018</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Carolina</forename><surname>Martín-Del-Campo-Rodríguez</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Center for Computing Research (CIC)</orgName>
								<orgName type="institution">Instituto Politécnico Nacional (IPN)</orgName>
								<address>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Helena</forename><surname>Gómez-Adorno</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Center for Computing Research (CIC)</orgName>
								<orgName type="institution">Instituto Politécnico Nacional (IPN)</orgName>
								<address>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Engeneering Institute (II)</orgName>
								<orgName type="institution">Universidad Nacional Autónoma de México (UNAM)</orgName>
								<address>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Grigori</forename><surname>Sidorov</surname></persName>
							<email>sidorov@cic.ipn.mx</email>
							<affiliation key="aff0">
								<orgName type="department">Center for Computing Research (CIC)</orgName>
								<orgName type="institution">Instituto Politécnico Nacional (IPN)</orgName>
								<address>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ildar</forename><surname>Batyrshin</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Center for Computing Research (CIC)</orgName>
								<orgName type="institution">Instituto Politécnico Nacional (IPN)</orgName>
								<address>
									<settlement>Mexico City</settlement>
									<country key="MX">Mexico</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">CIC-GIL Approach to Cross-domain Authorship Attribution Notebook for PAN at CLEF 2018</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F7FD5C5923254B80BD6EE4114123B784</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:32+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present the CIC-GIL approach to the cross-domain authorship attribution task at PAN 2018. This year's evaluation lab focuses on the closed-set attribution task applied to a Fanfiction corpus in five languages: English, French, Italian, Polish, and Spanish. We followed a traditional machine learning approach and selected different feature sets depending on the language. We evaluated document features such as typed and untyped character n-grams, word n-grams, and function word n-grams. Our final system uses the log-entropy weighting scheme and SVM as classifier.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The authorship attribution (AA) task consists in identifying the author of a given document among a list of candidates. There are several subtasks within the authorship attribution field such as author identification <ref type="bibr" target="#b3">[4]</ref>, author obfuscation <ref type="bibr" target="#b10">[11]</ref> and author profiling <ref type="bibr" target="#b11">[12]</ref>. The AA methods are used for many practical applications like electronic commerce, forensics, and humanities research <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b4">5]</ref>. The Authorship Attribution task is viewed as a multi-class, single-label classification problem, i.e. an automatic method has to assign a single class label (the author) to the unknown authorship documents.</p><p>Character n-grams are considered among the best feature representation for authorship attribution problems <ref type="bibr" target="#b15">[16]</ref>. In <ref type="bibr" target="#b13">[14]</ref>, the authors introduced a categorization of character n-grams and showed that some categories have better performance than others in an AA task. Furthermore, several studies indicate that the combination of different types of n-grams introduces useful information to the classification algorithm, providing a robust model <ref type="bibr" target="#b12">[13]</ref>.</p><p>This paper describes our approach to the cross-domain authorship attribution task at PAN 2018 <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b16">17]</ref>. We examined different document features (typed and untyped character n-grams, word n-grams, and function word n-grams), weighting schemes (tf-idf and log-entropy), and machine learning algorithms (support vector machines, multinomial naive Bayes, and multi-layer perceptron).</p><p>The corpus of the authorship attribution shared task at PAN 2018 is focused on crossdomain attribution. It is more challenging than the classical AA setting (the single-topic AA), because the training and testing documents can belong to different domains (eg. thematic area, genre). The documents in the corpus are fanfics, i.e., fictional literature based on the theme, atmosphere, style, characters, story world, etc. of a certain known author.</p><p>The corpus for development phase corpus (CDP), similarly to the corpus for test phase (CTP), is composed of a training corpus and a test corpus. Although the candidate authors for the CDP and CTP have similar characteristics, the candidate authors do not overlap.</p><p>The development phase corpus is composed of 10 problems divided in five languages (two problems each language): English, French, Italian, Polish and Spanish. The specifications of the problems are defined in <ref type="bibr" target="#b3">[4]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Methodology</head><p>In this section, we first cover the concept of typed character n-grams, then the logentropy weighting scheme, and finally the experimental settings of the methodology.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Typed character n-grams</head><p>Typed character n-grams, introduced by <ref type="bibr" target="#b13">[14]</ref> are subgroups of character n-grams that correspond to three distinct linguistic aspects: morphosyntax (represented by affix ngrams), thematic content (represented by word n-grams) and style (represented by punctuation n-grams). These subgroups are call super categories (SC). Each of these SC are divided in different categories:</p><p>-Affix n-grams: Capture morphology to some extent (prefix, suffix, space-prefix, space-suffix). -Word n-grams: Capture partial words and other word-relevant tokens (wholeword, mid-word, multi-word). -Punctuation n-grams: Capture patterns of punctuation (beg-punct, mid-punct, end-punct).</p><p>Some categories of character n-grams showed higher predictive capabilities in the AA task <ref type="bibr" target="#b13">[14]</ref> than using all possible n-grams (categorized and uncategorized). The redefinition stated by <ref type="bibr" target="#b6">[7]</ref> of these categories unambiguously assign each 3-gram to exactly one category and do not exclude any n-gram (as in the case of consecutive punctuation marks in the original proposal). Also, the authors showed that some categories have a better performance that others for AA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Log-entropy</head><p>Global weighting functions measure the importance of a term across the entire collection of documents <ref type="bibr" target="#b2">[3]</ref>. Previous research on document similarity judgments <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b8">9]</ref> has shown that entropy-based global weighting is generally better than the TF-IDF model. The log-entropy (le) weight is calculated with the following equation (Equation <ref type="formula" target="#formula_0">1</ref>):</p><formula xml:id="formula_0">le ij = e i × log(tf ij + 1),<label>(1)</label></formula><formula xml:id="formula_1">e i = 1 + j p ij × log p ij log n , where p ij = tf ij gf i , (<label>2</label></formula><formula xml:id="formula_2">)</formula><p>where n is the number of documents, tf ij is the frequency of the term i in document j, and gf i is the frequency of term i in the whole collection. A term that appears once in every document will have a weight of zero. A term that appears once in one document will have a weight of one. Any other combination of frequencies will assign a given term a weight between zero and one.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Experimental Settings</head><p>After an evaluation of several classification algorithms, in our final approach we chose Support Vector Machine (SVM) since this algorithm is recommended when the number of dimensions is greater than the number of samples (as in this case) <ref type="bibr" target="#b7">[8]</ref>. We used the SVM implementation of sklearn <ref type="bibr" target="#b0">[1]</ref>, using the strategy one-against-all and the default parameter settings. We analyzed several text representation schemes: typed character n-grams (with n varying from 2 to 8), untyped character n-grams (with n between 3 and 4), word n-grams (with n varying from 1 to 5) and function word n-grams proposed by Stamatatos <ref type="bibr" target="#b14">[15]</ref>.</p><p>We implemented the character n-gram types introduced by Sapkota et al. <ref type="bibr" target="#b13">[14]</ref>, but with the redefinitions of Markov et al. <ref type="bibr" target="#b6">[7]</ref>, which make them more accurate and complete.</p><p>For function word n-grams we used the 50 most frequent stop-words, as described in <ref type="bibr" target="#b14">[15]</ref>, to form the n-grams (with a value of n equal to 8). For English, the 50 most frequent stop-words mentioned in <ref type="bibr" target="#b14">[15]</ref> were used. For the other languages (French, Italian, Polish and Spanish) the 50 most frequent stop-words were extracted from the development corpus (from the training).</p><p>We evaluated different combination of features for the different languages in the corpus. We also performed an evaluation study in order to identify the most useful typed character n-gram categories for each language. Table <ref type="table" target="#tab_0">1</ref> shows the combination of features as well as the types of character n-grams used in our final submission.</p><p>Moreover, we experimented with different feature document frequency thresholds. We considered thresholds between 1 and 3, i.e. features that occur in at least 1, 2, or 3 documents in each problem. We found that the features that occur in at least 2 documents achieved the best classification performance in our experiments. Following the experimental settings presented in <ref type="bibr" target="#b2">[3]</ref>, we examined two feature representations based on a global weighting scheme: log-entropy and tf-idf. Global weighting functions measure the importance of a word across the entire collection of documents. Previous research on document similarity judgments <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b8">9]</ref> and authorship attribution <ref type="bibr" target="#b2">[3]</ref> has shown that entropy-based global weighting is generally better than the if-idf model. We use log-entropy as weighting function for out final version.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluation Measure and Results</head><p>The macro-averaged F1 score is used for evaluating the performance of the systems participating in the authorship attribution shared task at PAN CLEF 2018 <ref type="bibr" target="#b3">[4]</ref>.</p><p>The final configuration of our approach was selected based on the classification performance on the test set of the development phase corpus (DPC). Table <ref type="table" target="#tab_1">2</ref> shows the results obtained on the DPC with the above-specified configuration evaluated on the TIRA platform <ref type="bibr" target="#b9">[10]</ref>. The results achieved in the test phase corpus (TPC) are shown in Table <ref type="table" target="#tab_2">3</ref>. It can be observed that the performance on the TPC is much lower than in the DPC. This behav-ior can be explained by our decision of tuning our system based on the classification performance over the test set of the DPC. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>We presented the system that was submitted to the Cross-domain Authorship Attribution task at PAN 2018. Our experiments were performed using different features, finding that a specific set of features per language is the best approach to improve performance. Our approach had a good performance on the development phase corpus (Macro-Average F1: 0.747), but this performance was severely diminished on the test phase corpus (Macro-Average F1: 0.588). Based on the current technique, there are still opportunities for further enhancements.</p><p>In future research, we would like to consider a cross-validation approach for the development phase corpus to make the system more robust.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Features included for each language in our final submission.</figDesc><table><row><cell cols="2">Language Features</cell><cell>Typed character n-grams categories</cell></row><row><cell>English</cell><cell>typed character n-grams (2, 3, 5)</cell><cell>whole-word, mid-word, multi-word,</cell></row><row><cell></cell><cell></cell><cell>beg-punct, mid-punct, end-punct</cell></row><row><cell>French</cell><cell>typed character n-grams (2, 4, 5)</cell><cell>prefix, mid-word, multi-word, beg-</cell></row><row><cell></cell><cell></cell><cell>punct, end-punct</cell></row><row><cell>Italian</cell><cell>word n-grams (1, 2, 3, 5)</cell><cell></cell></row><row><cell>Polish</cell><cell>word n-grams (2, 5)</cell><cell></cell></row><row><cell>Spanish</cell><cell>character n-grams(3), typed character</cell><cell>beg-punct</cell></row><row><cell></cell><cell>n-grams(4) and word n-grams(1, 2)</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Results of the Cross Domain Authorship-Attribution on the Development Phase Corpus</figDesc><table><row><cell>Language</cell><cell>Problem</cell><cell>Macro-Average F1</cell></row><row><cell>English</cell><cell>problem 1</cell><cell>0.582</cell></row><row><cell></cell><cell>problem 2</cell><cell>0.783</cell></row><row><cell>French</cell><cell>problem 3</cell><cell>0.659</cell></row><row><cell></cell><cell>problem 4</cell><cell>0.938</cell></row><row><cell>Italian</cell><cell>problem 5</cell><cell>0.702</cell></row><row><cell></cell><cell>problem 6</cell><cell>0.637</cell></row><row><cell>Polish</cell><cell>problem 7</cell><cell>0.589</cell></row><row><cell></cell><cell>problem 8</cell><cell>0.893</cell></row><row><cell>Spanish</cell><cell>problem 9</cell><cell>0.804</cell></row><row><cell></cell><cell>problem 10</cell><cell>0.879</cell></row><row><cell cols="2">Overall score</cell><cell>0.747</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Results of the Cross Domain Authorship-Attribution Task on the Test Phase Corpus</figDesc><table><row><cell>User</cell><cell cols="2">Macro-Average F1 Runtime</cell></row><row><cell>custodio18</cell><cell>0.685</cell><cell>00:04:27</cell></row><row><cell>murauer18</cell><cell>0.643</cell><cell>00:19:15</cell></row><row><cell>halvani18</cell><cell>0.629</cell><cell>00:42:50</cell></row><row><cell>mosavat18</cell><cell>0.613</cell><cell>00:03:34</cell></row><row><cell>yigal18</cell><cell>0.598</cell><cell>00:24:09</cell></row><row><cell>delcamporodriguez18</cell><cell>0.588</cell><cell>00:11:01</cell></row><row><cell>pan18-baseline</cell><cell>0.584</cell><cell>00:01:18</cell></row><row><cell>miller18</cell><cell>0.582</cell><cell>00:30:58</cell></row><row><cell>schaetti18</cell><cell>0.387</cell><cell>01:17:57</cell></row><row><cell>gagala18</cell><cell>0.267</cell><cell>01:37:56</cell></row><row><cell>garciacumbreras18</cell><cell>0.139</cell><cell>00:38:46</cell></row><row><cell>tabealhoje18</cell><cell>0.028</cell><cell>02:19:14</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work was partially supported by the Mexican Government (CONACYT projects 240844, SNI, COFAA-IPN, SIP-IPN 20181849, 20171813) and Honeywell Grant.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">API design for machine learning software: experiences from the scikit-learn project</title>
		<author>
			<persName><forename type="first">L</forename><surname>Buitinck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Louppe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pedregosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mueller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Grisel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Niculae</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Prettenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gramfort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Grobler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Layton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanderplas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joly</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Holt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECML PKDD Workshop: Languages for Data Mining and Machine Learning</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="108" to="122" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">On admissible linguistic evidence</title>
		<author>
			<persName><forename type="first">M</forename><surname>Coulthard</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Law &amp; Policy</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="page">441</biblScope>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Author clustering using hierarchical clustering analysis</title>
		<author>
			<persName><forename type="first">H</forename><surname>Gómez-Adorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Aleman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Vilariño</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Sanchez-Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Pinto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sidorov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CLEF 2017 Working Notes. CEUR Workshop Proceedings</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschugnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Specht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF 2018 Evaluation Labs</title>
		<title level="s">CEUR Workshop Proceedings, CLEF and CEUR-WS</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">Y</forename><surname>Nie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Soulier</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2018-09">Sep 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automatically identifying pseudepigraphic texts</title>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Seidman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2013 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="1449" to="1454" />
		</imprint>
	</monogr>
	<note>EMNLP &apos;13</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">An empirical evaluation of models of text document similarity</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Navarro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Nikkerud</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Cognitive Science Society</title>
				<meeting>the Cognitive Science Society</meeting>
		<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="volume">27</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Improving cross-topic authorship attribution: The role of pre-processing</title>
		<author>
			<persName><forename type="first">I</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sidorov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18 th International Conference on Computational Linguistics and Intelligent Text Processing</title>
				<meeting>the 18 th International Conference on Computational Linguistics and Intelligent Text Processing</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Scikit-learn: Machine learning in Python</title>
		<author>
			<persName><forename type="first">F</forename><surname>Pedregosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gramfort</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Thirion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Grisel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blondel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Prettenhofer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Weiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Dubourg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Vanderplas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Passos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cournapeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Brucher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Perrot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Duchesnay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="2825" to="2830" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Comparison of human and latent semantic analysis (lsa) judgements of pairwise document similarities for a news corpus</title>
		<author>
			<persName><forename type="first">B</forename><surname>Pincombe</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
		<respStmt>
			<orgName>DEFENCE SCIENCE AND TECHNOLOGY ORGANISATION SALISBURY (AUSTRALIA) INFO SCIENCES LAB</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Tech. rep</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Improving the Reproducibility of PAN&apos;s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Gollub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14</title>
				<editor>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Lupu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Clough</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sanderson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Hall</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Hanbury</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Toms</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2014-09">Sep 2014</date>
			<biblScope unit="page" from="268" to="299" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Overview of the Author Obfuscation Task at PAN 2018: A New Approach to Measuring Safety</title>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hagen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Schremmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF 2018 Evaluation Labs</title>
		<title level="s">CEUR Workshop Proceedings, CLEF and CEUR-WS</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">Y</forename><surname>Nie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Soulier</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2018-09">Sep 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter</title>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y-Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">Working Notes Papers of the CLEF 2018 Evaluation Labs</title>
		<title level="s">CEUR Workshop Proceedings, CLEF and CEUR-WS</title>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><forename type="middle">Y</forename><surname>Nie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Soulier</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2018-09">Sep 2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Comparison of character n-grams and lexical features on author, gender, and language variety identification on the same spanish news corpus</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Sanchez-Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Markov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gómez-Adorno</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sidorov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference of the Cross-Language Evaluation Forum for European Languages</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="145" to="151" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Not all character n-grams are created equal: A study in authorship attribution</title>
		<author>
			<persName><forename type="first">U</forename><surname>Sapkota</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Montes-Y Gómez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Solorio</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies</title>
				<meeting>the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="93" to="102" />
		</imprint>
	</monogr>
	<note>NAACL-HLT&apos;15, Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Plagiarism detection using stopword n-grams</title>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">62</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="2512" to="2527" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">A survey of modern authorship attribution methods</title>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="538" to="556" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Overview of PAN-2018: Author Identification, Author Profiling, and Author Obfuscation</title>
		<author>
			<persName><forename type="first">E</forename><surname>Stamatatos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Rangel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tschuggnall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kestemont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Rosso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Stein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Potthast</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction. 9th International Conference of the CLEF Initiative (CLEF 18)</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Bellot</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Trabelsi</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Mothe</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Murtagh</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Nie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Soulier</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Sanjuan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg New York</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2018-09">Sep 2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
