<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Optimizing Authorship Profiling of Online Messages</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Adeola</forename><forename type="middle">O</forename><surname>Opesade</surname></persName>
							<email>morecrown@gmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">Africa Regional Centre for Information Science</orgName>
								<orgName type="institution">University of Ibadan</orgName>
								<address>
									<country key="NG">Nigeria</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Optimizing Authorship Profiling of Online Messages</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4B57FF4DF44D76BAC023F5D17FFABA4D</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T00:54+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Authorship profiling</term>
					<term>Machine learning</term>
					<term>Computational linguistics</term>
					<term>Natural Language Processing</term>
					<term>Nigerian English</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Authorship profiling is of growing importance in the current information age, partly due to its application in digital forensics. Methodologies of profiling like any other authorship analysis consist majorly of feature extraction and application of analytical techniques. Choice of feature sets and analytical techniques may significantly affect the performance of authorship analysis. Hence, a need for methods that can help improve on the success of authorship profiling undertakings. The present study sought through experiments, the writing features, analytical technique and number of class labels that can help improve the effectiveness of profiling the country of affiliation of authors of online messages. The experiment showed that the most effective model was achieved when all feature set types in our study were used within a two-class dataset that was analysed with the Neural Network (Multilayer Perceptron) machine learning scheme. The study recommends a need for further studies in finding models that can maximize both effectiveness and efficiency in profiling the authorship of online messages.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Electronic messages are extensively used to distribute information over such channels as e-mail, Internet newsgroups, Internet chat rooms, Internet forums and other user generated contents on the Web. These messages are quite different from other forms of writings particularly, because of their brevity. Unfortunately, unethical hands and criminals exploit the convenience of these media to carry out their obnoxious goals. Digital forensics require the use of scientifically derived and proven methods towards the preservation, collection, validation, identification, analysis, interpretation, documentation and presentation of digital evidence derived from digital sources for litigation purposes.</p><p>Authorship profiling is one of the major classes of authorship attribution problems. It seeks the demographic or psychological group of the author of an anonymous text. Its application in forensics and digital security has made it to be of growing importance in the present information age. Methodologies of profiling like any other authorship analysis consist majorly of feature extraction and application of analytical techniques. Choice of feature sets and analytical techniques may significantly affect the performance of authorship analysis <ref type="bibr" target="#b1">[1]</ref>; thus, studies into optimization of authorship profiling of online messages can assist in improving the success of identifying sources of security threats perpetrated through web-based channels.</p><p>A number of previous studies ( <ref type="bibr" target="#b1">[1]</ref>; <ref type="bibr">[22]</ref>; <ref type="bibr" target="#b4">[3]</ref>) have investigated some parameters that could affect the effectiveness of authorship attribution undertakings. These studies, however, focused on authorship identification problem and not on authorship profiling. Considering the potential of authorship profiling in investigating transnational digital breaches, the present study seeks to find through experiments the writing-style features, classification techniques as well as possible number of class options that can maximize the effectiveness of profiling the authorship of electronic messages. The following research questions were pursued in order to achieve the purpose the study:</p><p>Research Question 1: Which feature type set maximizes the effectiveness of profiling the country of affiliation of writers of online messages?</p><p>Research Question 2: Which classification scheme maximizes the effectiveness of profiling the country of affiliation of writers of online messages?</p><p>Research Question 3: Which class labelling option maximizes the effectiveness of profiling the country of affiliation of writers of online messages? Research Question 4: What is the performance of the resultant model in classifying electronic messages to writers' countries of affiliation?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">LITERATURE REVIEW 2.1 Authorship Attribution Problems</head><p>Authorship attribution is a process of examining the characteristics of a piece of writing in order to draw conclusions about its author. Authorship attribution problems vary in complexity. They have been categorized into three major classes, namely, authorship identification, authorship profiling and authorship verification. The most straightforward version of these three is the identification problem which involves the determination of the actual author of a given text among a small set of candidate authors. Given a set of writings of a number of authors, the task in authorship identification is to assign a new piece of writing to one of them <ref type="bibr" target="#b5">[4]</ref>. In authorship verification, there is no closed candidate set but there is one suspect and the challenge is to determine if the suspect is or is not the author. In this case, examples of the writing of a single author are given and the task is to verify that a given target text was or was not written by this author. Hence, verification can be thought of as a one-class classification problem and it is significantly more difficult than basic authorship identification problem <ref type="bibr" target="#b6">[5]</ref>.</p><p>In authorship profiling (also known as authorship characterization problem) there is no candidate set at all; the challenge is to provide as much demographic or psychological information as possible about the author. Unlike the identification problem, authorship profiling does not begin with a set of writing samples from known candidate authors. Instead, it exploits the sociolinguistic observation that different groups of people speaking or writing in a particular genre and in a particular language, use that language differently; that is, they vary in how often they use certain words or syntactic constructions in addition to variation in pronunciation or intonation <ref type="bibr">[6]</ref>. Profiling problem is concerned with determining such characteristics as gender, educational and cultural backgrounds, language familiarity and so on of the author that produced a piece of work. This is a harder problem than the identification problem since it characterizes the writing style of a set of writers rather than the unique style of a single person <ref type="bibr" target="#b8">[7]</ref>.</p><p>Despite variations in the complexities of authorship problems, choices of appropriate linguistic features and analytical techniques are paramount.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Authorship Attribution Methods</head><p>One of the main components of authorship attribution methods is the extraction of linguistic features that represent the writing style of an author or author group. Language, like genetics, can be characterized by a very large set of potential features that may or may not show up in any specific sample, and that may or may not have obvious large-scale impact. By identifying the features characteristic of a group or individual of interest, and then finding those features in an anonymous document, one can support a finding that the document was written by that person or a member of that group <ref type="bibr" target="#b9">[8]</ref>. The various feature sets, otherwise known as feature metrics in computational linguistics can be classified into four main classes, which are the lexical, syntactical, contentspecific and structural features <ref type="bibr" target="#b10">[9]</ref>. Researchers vary in their choices of linguistic features; while some used feature(s) that belong to a single class (for example, <ref type="bibr" target="#b11">[10]</ref>; <ref type="bibr" target="#b12">[11]</ref>; <ref type="bibr" target="#b13">[12]</ref>; and <ref type="bibr" target="#b10">[9]</ref>, others (such as <ref type="bibr">[6]</ref>; <ref type="bibr" target="#b2">[2]</ref>; <ref type="bibr" target="#b5">[4]</ref>; <ref type="bibr" target="#b4">[3]</ref>; <ref type="bibr" target="#b8">[7]</ref>; <ref type="bibr" target="#b1">[1]</ref>; <ref type="bibr" target="#b14">[13]</ref>; <ref type="bibr" target="#b15">[14]</ref>) used features across multiple feature classes.</p><p>The second component is the application of analytical techniques to feature sets for supervised or unsupervised learning. Different analytical techniques have been used in previous authorship attribution studies. These techniques can be classified into three, namely, the unitary invariant, multivariate and machine learning approaches <ref type="bibr" target="#b9">[8]</ref>. Machine learning examines previous examples and their outcomes and learns how to reproduce these and make generalisations about new cases. Machine learning algorithms differ in terms of level of data and abilities to resolve data ambiguities such as noise or missing data. Machine learning techniques include rule based algorithms such as OneR, neural networks such as Multilayer Perceptron, statistical modelling algorithm such as Naive Bayes, decision trees such as J48, linear models such as linear regression and Support Vector Machine and instance-based learning algorithm such as Nearest Neighbour.</p><p>Unlike in the choice of feature sets, researchers are less varied in their choices of analytical techniques. While older studies tend to favour the use of Principal Component Analysis, the more recent ones tend towards the use of Support Vector Machine. Most previous studies reported the use of only a single analytical technique. Considering such statement as made by <ref type="bibr" target="#b16">[15]</ref>.</p><p>Experience shows that no single machine learning scheme is appropriate to all data mining problems. The universal learner is an idealistic fantasy. Real datasets vary and to obtain accurate models, the bias of the learning algorithm must match the structure of the domain. Data mining is an experimental science (pg 365).</p><p>Choice of machine learning scheme should be based on the result of a prior experiment that validates its suitability to the dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Related Authorship Studies</head><p>A number of previous studies have shown relative performances of a number of feature types and analytical techniques in authorship analyses. <ref type="bibr" target="#b4">[3]</ref> studied the results of authorship identification using many authors and limited data on learning. Their result showed that systematically increasing the amount of authors under investigation led to a significant decrease in performance. Their study also revealed that providing a more heterogeneous set of features improves the system significantly. <ref type="bibr" target="#b1">[1]</ref> investigated the types of writing-style features and classification techniques that were effective for identifying the authorship of online messages. They reported that the accuracy kept increasing as more types of features were used and that Support Vector Machine (SVM) outperformed Neural Networks (NN), which in turn outperformed the C4.5 classifier. The best accuracy was achieved when SVM and all feature types were used but classifier performance reduced as the number of authors increased. <ref type="bibr" target="#b2">[2]</ref> through experiment demonstrated that inclusion of stylistic idiosyncrasy features to letter n-grams, function words and to a combination of n-grams and function words consistently led to improved accuracy for identifying the native language of the author of a given English language text. The studies of <ref type="bibr" target="#b4">[3]</ref> and <ref type="bibr" target="#b1">[1]</ref> are situated within the identification domain of authorship attribution problems because they started with a close number of candidate authors, while that of <ref type="bibr" target="#b2">[2]</ref> was a profiling problem. However, their focus was majorly to show the ability of idiosyncrasies in detecting writer's native language. It therefore, did not address some of the salient issues covered by <ref type="bibr" target="#b1">[1]</ref> which are relative performances of analytical techniques and effect of increasing the number of candidate authors. Also, the corpus used by <ref type="bibr" target="#b2">[2]</ref> was the International Corpus of Learner English (ICLE) which had between 579 and 846 words. These numbers were quite high for an online message, which are usually very short. The present study focuses on shorter texts which characterise online messages. Therefore, the present study seeks to find the writing-style (linguistic) features, classification techniques as well as possible number of class options that can maximize the effectiveness of profiling the native language of the author of an online message.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">EXPERIMENTATION FOR OPTIMIZING AUTHORSHIP PROFILING OF ONLINE MESSAGES 3.1 Problem formulation</head><p>Given a number of online messages written in English language by nationals of selected African countries, namely, Cameroon, Ghana, Liberia, Nigeria and Sierra-Leone. The goal is to find the types of writing-style features, the classification technique as well as possible number of class options that can maximize the effectiveness of profiling the linguistic origin of anonymous electronic texts written by the nationals of any of the selected countries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Research Method</head><p>A multistage sampling technique was used to select a representative sample of electronic texts from the population of texts contained in the relevant country pages of the website www.topix.com. To get the texts that could be useful for a supervised learning approach of the study, each text was opened, read and assessed based on the number of words contained and a sense of affiliation to the respective country as depicted in the content. A comment was considered to be affiliated to (and labelled to be from) a particular country if it was found in that country's forum and if it contained such phrases as 'our country', 'our beloved country' and other related ones in its discourse. Initially the researcher targeted selecting texts with a hundred or more words; however, this was reduced to texts with twenty <ref type="bibr" target="#b21">(20)</ref> or more words because of the scarcity of large texts on the discussion forums. The numbers of texts selected for the study in November 2011 and based on the assessment criteria are as shown in Table <ref type="table" target="#tab_0">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">Text Pre-processing and Processing</head><p>The corpora were subjected to pre-processing in order to put them in the format expected by the relevant software for text processing. The pre-processing tasks included deletion of e-mail headers, removal of control codes, text aggregation, and removal of non-ASCII characters. Text processing was achieved by extracting linguistic features from the sampled texts using computer codes written by the researcher in Python 2.6.4 programming language, based on the natural language toolkit (NLTK) version 2.0. Some of the specific issues handled in the course of text processing were tokenization, part of speech tagging and linguistic feature extraction.</p><p>Although there is no agreement on a best set of features for a wide range of application domains, selected feature metrics must be reliable characteristic of attribution domain <ref type="bibr" target="#b22">[21]</ref>. Certain features were extracted in the present study, based on their relevance as determined from relevant literature on authorship attibution and Nigerian Englishes ( <ref type="bibr" target="#b17">[16]</ref>; <ref type="bibr" target="#b18">[17]</ref>). Extracted features were syntactic features comprising the twenty (20) most frequent function words in the topix.com corpus, Idiosyncratic features comprising frequency of occurrence of spelling errors, adverb-verb part of speech (POS) bigram distribution and article omission/inclusion distribution. Structural features comprising lexical diversity; and content specific features consisting of twenty <ref type="bibr" target="#b21">(20)</ref> most frequent noun, adjective, verb and adverb unigrams in the topix.com corpus. The features extracted and their denotations are as shown in Table <ref type="table" target="#tab_1">2</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F4</head><p>The decision to extract twenty most frequent features (function word, noun, adjective, verb and adverb unigrams) was as a result of a prior experiment which showed that the summation of the frequencies of occurrence of the twenty most frequent features accounted for at least 60% of the cumulative frequency of all features extracted in each case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Experimental Setup</head><p>i. Class Labelling: According to the study of <ref type="bibr" target="#b4">[3]</ref> learner's performance changes with number of candidate authors. To find out the effect of varying the number of classes on the classification performance in the present study, the dataset was copied into three different files having all parameters being the same except the class labels. The class labels were controlled as presented in Table <ref type="table" target="#tab_2">3</ref>. The texts in Dataset 1 bear their original class labels, that is, the actual countries of affiliation of the writers as determined from the forums and the texts. There are therefore five different class labels, representing the five country sources of the texts. Dataset 2 has three class labels; texts from Nigeria and Ghana bear their original country source labels while those from the other three countries were combined and labelled 'Non-Ghana-Nigeria'. This was informed by a previous study that showed varying degrees of similarity in the English language usage among the selected countries. Dataset 3 labelled texts from Nigeria as Nigeria while texts from the other four countries were combined under the label ' Non-Nigeria'. This was done to achieve a two-class dataset option. Each of the three datasets (Dataset 1, Dataset 2 and Dataset 3) with each of the feature set types (F1, F2, F3, F4) and all their possible combinations (F1+F2, F1+F2+F3, F1+F2+F3+F4, F1+F2+F4, F1+F3, F1+F4, F2+F3, F2+F3+F4, F2+F4, F3+F4, F3+F4+F1) were analysed using the four machine learning algorithms.</p><p>Ten fold cross validation was used to evaluate the models' performances based on percent correct (percentage of all datasets that are classified correctly) and Kappa statistic (measure of the agreement between predicted and observed categorization, while correcting for agreement that happens by chance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Evaluation of the Experiments</head><p>Tables in Appendix <ref type="figure" target="#fig_1">1</ref> show the percent correct and kappa statistic values derived for each of the datasets in our experiment. The results are presented successively for Naive Bayes, SMO, J48 and multilayer perceptron. It could be observed from the tables that the percent correct values appear to be highest for Dataset 3 while Kappa statistics appear to be highest for Dataset 2. This observation cuts across virtually all features sets and classifiers. This implies that classifiers were better able to classify Dataset 3 correctly compared to other datasets while classifications achieved in Dataset 2 gave better agreement between predicted and observed categorization having corrected for agreement that happened by chance. Worthy to be noted is the result of SMO in Dataset 3, although the percent correct values were relatively high, Kappa statistics were all zero. Lack of coherence in the directions of the two performance measures led us to using the product of the two measures (percent correct and kappa statistic) as a basis for comparing models' performances. This decision to use the product was informed by the theory of Dimensional Analysis which is a problem-solving method that uses the fact that any number or expression can be multiplied by one without changing its value. One can only meaningfully add or subtract quantities of the same type but can multiply or divide quantities of different types. When two measurements are multiplied together the product is of a type depending on the types of the measurements. This analysis is routinely applied in physics and it is an engineering tool that is widely applied to numerous engineering problems for designing and testing all types of engineering and physical systems ( <ref type="bibr" target="#b19">[18]</ref>; <ref type="bibr" target="#b20">[19]</ref>). The result of the dot products of the two measures is as presented in Appendix 2. The table in Appendix 2 presents the performances of our models taking into consideration the two performance measures. We consider this table more representative of the models' performances because it combines the strengths and weaknesses of the two performance measures. Answers to research questions will, therefore, be based on the content of this table.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">RESULTS AND DISCUSSION</head><p>Research Question 1: Which feature set type maximizes the effectiveness of profiling the country of affiliation of writers of online messages?</p><p>Figure <ref type="figure" target="#fig_1">1</ref> is a derivative of the table in Appendix 2, it shows the product of percent correct and kappa statistic values derived for the feature set types in our experiment. The results are presented successively for Naive Bayes, SMO, J48 and Neural Network. Across all the three datasets, the feature set that combined all feature types (F1+F2+F3+F4) performed best. This is followed by (F2+F4), (F2+F3+F4) and (F1+F2+F3), while the performance of F1 was the least. Our result shows that inclusion of all features from all the four types (lexical, syntactic, idiosyncrasies and content specific) produced the most effective model. Again the result was consistent with those of <ref type="bibr" target="#b21">[20]</ref> and <ref type="bibr" target="#b2">[2]</ref> and <ref type="bibr" target="#b1">[1]</ref> pg 365 who reported that combining feature types in their studies gave a better result. Using vocabulary richness only produced the poorest result probably because of the short length of online messages in the study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Research Question 2: Which classification scheme maximizes the effectiveness of profiling the country of affiliation of writers of online messages?</head><p>Figure <ref type="figure" target="#fig_2">2</ref> shows the relative performances of the four classifiers across all feature types (F1+F2+F3+F4) and datasets. Neural Network (multilayer perceptron) performed best when compared to the other three classifiers. Its performance was particularly the highest on the feature set (F1+F2+F3+F4) contained in our two-class option dataset (Dataset 3). Most previous studies considered SVM most appropriate in authorship attribution (though most times without carrying out a prior experiment). <ref type="bibr" target="#b1">[1]</ref> however, reported that there were no significant performance differences between SVM and neural networks. It could be observed that SVM implementation (SMO) outperformed the other three classifiers when the texts contained their natural class labels (Dataset 1) and performed most terribly on Dataset 3. This corroborates the submission of <ref type="bibr" target="#b16">[15]</ref> that no single machine learning scheme is appropriate to all data mining problems because real datasets vary and to obtain accurate models, the bias of the learning algorithm must match the structure of the domain.</p><p>Meaning that the structure of our Dataset 3 is most amenable to neural network than any of the other machine learning schemes (Naive Bayes, SMO, J48) in our study. Worthy of note also is the usefulness of our application of the dimensional analysis principle which informed the multiplication of the two performance measures in our study. For example, if our comparison had been based on percent correct (in Appendix 1) only, we might have erroneously rated the performance of SMO relatively high on Dataset 3.  The figure shows that the dataset having two class options (Dataset 3) performed best followed by the one having three class options (Dataset 2) and lastly the one having the instances labelled naturally, having five classes (Dataset 1). The result is consistent with those of <ref type="bibr" target="#b4">[3]</ref> and <ref type="bibr" target="#b1">[1]</ref> that reported that authorship attribution success improves with reduction in the number of authors or author classes. In the specific however, the present result shows that if we can reduce an authorship profiling problem to a two-class one, we can get an appreciable improvement in the effectiveness of authorship profiling task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Research</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Research Question 4: What is the performance of the resultant model in classifying electronic messages to writers' countries of affiliation?</head><p>Using the TrainTestSplitMaker component of WEKA's knowledge flow interface to evaluate the performance of our model in classifying electronic messages to writers' countries of affiliation. Separate two-class label file was created for each country, resulting in a dataset for each country, where all attributes except the class attribute were the same. The class attribute for a particular country had instances labelled either as 'the country name' such as (Nigeria, Ghana, Cameroon) or as 'non country name' such as (Non-Nigeria, Non-Ghana, Non-Cameroon). Tables <ref type="table" target="#tab_3">4</ref> shows the effectiveness of profiling authors' countries of affiliation by the resultant model. Application of our optimization method resulted in a remarkable improvement in the profiling of each country from the others. The study showed that we could achieve a percent correct ranging between 70.8% and 88.2% at Kappa statistics ranging between 0.04 and 0.34 compared to the highest possible percent correct value of 43.8% at kappa statistics of 0.26% if our method was not applied. This however is a trade-off on the efficiency of the profiling process because we needed to create separate labels for the class attribute. The extent of improvement in model performance however can be said to outweigh the additional effort. The detailed performance of the model is as shown in Table <ref type="table" target="#tab_5">5</ref>. The resultant model performed well when we consider the weighted averages of the performance measures of each dataset. It could however, be observed that the model was better at identifying texts that were not from the country as against those that were from the country in each case. It could also be observed that the performance of the model in predicting each country's texts vary directly with the number of each country's texts in the study corpus. The best performance was achieved in profiling Nigerian electronic texts from Non Nigeria texts, followed by that of Sierra Leone and then Ghana. Thus, it could be deduced that performance of our model could be much improved with bigger sub-corpora sizes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">CONCLUSION</head><p>The study through experiments sought the number of class options, feature set types and machine learning scheme that maximize the effectiveness of identifying the countries of affiliation of authors of online messages composed in English language. The online messages in our corpus were collected from online forums of five African countries with average length of 52 to 102 words. Using a product of percent correct and kappa statistics as our bases for model justification, the experiment showed that we achieved the most effective model when all feature set types, contained in a two-class dataset was analysed with the neural network (multilayer perceptron) machine learning scheme. Application of the parameters of the most effective model (derived from the experiment) to profiling the countries of affiliation of authors of the online messages resulted in about a hundred percent improvement in effectiveness.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Experiments were carried out using the Experimenter interface of the open source Waikato Environment for Knowledge Analysis (WEKA) machine learning tool. In this study, four machine learning algorithm implementations in WEKA namely naïve Bayes, SMO (SVM implementation), J48 and Multilayer perceptron (Neural network implementation) were used. The experiment was carried out to compare the performances classifier models in the phase of: a. Changing the number of classes. b. Changing the linguistic feature sets. c. Changing classifier algorithms.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Comparison of feature sets performances</figDesc><graphic coords="4,344.30,274.70,184.30,118.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Relative performances of the four classifiers across all feature and data sets.</figDesc><graphic coords="4,343.55,589.55,185.90,105.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Question 3 :</head><label>3</label><figDesc>Which class labelling option maximizes the effectiveness of profiling the country of affiliation of writers of online messages?Fig.3shows the percent correct values derived for each of the datasets in our experiment using the most precise classification scheme (Neural Network) and all feature sets (F1+F2+F3+F4) only. The results are presented successively for Naive Bayes, SMO, J48 and Neural Network.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Column Chart of Classifier Performances with Varied Class Labelling Options</figDesc><graphic coords="5,84.95,416.15,181.10,102.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 : Training Data Set Country's forum website No. of pages Pages selected</head><label>1</label><figDesc></figDesc><table><row><cell>No. of</cell></row><row><cell>selected</cell></row><row><cell>texts</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 : Extracted Linguistic Features Feature type Feature metric</head><label>2</label><figDesc></figDesc><table><row><cell>Denotation</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 : Dataset Class Labelling Options File Name No of Class Labels Class Labels Remark</head><label>3</label><figDesc></figDesc><table><row><cell>Dataset1 5</cell><cell>Nigeria,</cell><cell>Labelling according to</cell></row><row><cell></cell><cell>Ghana,</cell><cell>texts' original classes.</cell></row><row><cell></cell><cell>Cameroon,</cell><cell></cell></row><row><cell></cell><cell>Liberia, Sierra-</cell><cell></cell></row><row><cell></cell><cell>Leone</cell><cell></cell></row><row><cell>Dataset 2 3</cell><cell>Nigeria,</cell><cell>Labelling informed by</cell></row><row><cell></cell><cell>Ghana, Non-</cell><cell>language similarities</cell></row><row><cell></cell><cell>Ghana-Nigeria</cell><cell>between the selected</cell></row><row><cell></cell><cell></cell><cell>countries as found in a</cell></row><row><cell></cell><cell></cell><cell>previous study [21].</cell></row><row><cell>Dataset 3 2</cell><cell>Nigeria, Non-</cell><cell>Testing a 2-class</cell></row><row><cell></cell><cell>Nigeria</cell><cell>labelling scheme which</cell></row><row><cell></cell><cell></cell><cell>can enable the</cell></row><row><cell></cell><cell></cell><cell>identification of online</cell></row><row><cell></cell><cell></cell><cell>texts from a country</cell></row><row><cell></cell><cell></cell><cell>from those of other</cell></row><row><cell></cell><cell></cell><cell>countries put together.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 : Effectiveness of Profiling Authors' Countries of</head><label>4</label><figDesc></figDesc><table><row><cell></cell><cell></cell><cell>Affiliation</cell><cell></cell></row><row><cell>Country</cell><cell>Percent</cell><cell>Kappa</cell><cell>PC*KS</cell></row><row><cell></cell><cell>Correct</cell><cell>Statistics</cell><cell></cell></row><row><cell>Nigeria</cell><cell>75.80</cell><cell>0.34</cell><cell>25.95</cell></row><row><cell>Cameroon</cell><cell>73.80</cell><cell>0.10</cell><cell>7.68</cell></row><row><cell>Ghana</cell><cell>78.40</cell><cell>0.27</cell><cell>21.54</cell></row><row><cell>Liberia</cell><cell>88.20</cell><cell>0.04</cell><cell>3.23</cell></row><row><cell>Sierra Leone</cell><cell>70.80</cell><cell>0.28</cell><cell>19.59</cell></row><row><cell>PC*KS</cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>denotes Percent correct* Kappa statistics</head><label></label><figDesc></figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 5 : Detailed Prediction Performance of the Resultant Model TP Rate FP Rate Preci- sion</head><label>5</label><figDesc></figDesc><table><row><cell>Re-</cell><cell>F-</cell><cell>RO</cell></row><row><cell>call</cell><cell>score</cell><cell>C</cell></row><row><cell></cell><cell></cell><cell>Area</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0">CoRI'16, Sept 7-9, 2016, Ibadan, Nigeria.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The study achieved greater effectiveness but with a trade-off on efficiency. We look forward to having a model that can maximize both effectiveness and efficiency in profiling the authorship of online messages, and this constitutes a need for further studies. This approach in its present state can be very appropriate if a group is suspected and the purpose of authorship attribution is to affirm one's thought about the suspect's group of affiliation.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title/>
		<author>
			<persName><surname>References</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A framework for authorship identification of online messages: writingstyle features and classification techniques</title>
		<author>
			<persName><forename type="first">R</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="378" to="393" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Automatically determining an anonymous author&apos;s native language</title>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zigdon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Lecture Notes in Computer Science (LNCS</title>
		<imprint>
			<biblScope unit="volume">3495</biblScope>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">B</forename><surname>Kantor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Muresan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wang</surname></persName>
		</author>
		<title level="m">ISI</title>
				<meeting><address><addrLine>Berlin</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page" from="209" to="217" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Authorship attribution and verification with many authors and limited data</title>
		<author>
			<persName><forename type="first">K</forename><surname>Luyckx</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Daelemans</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd International Conference on Computational Linguistics held in Manchester from</title>
				<meeting>the 22nd International Conference on Computational Linguistics held in Manchester from</meeting>
		<imprint>
			<date type="published" when="2008-08-22">2008. 18-22 August 2008</date>
			<biblScope unit="page" from="513" to="520" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Authorship attribution with thousands of candidate authors</title>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Argamon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Messeri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 29th annual international ACM SIGIR (Special Interest Group on Information Retrieval) conference on research and development in information retrieval</title>
				<meeting>the 29th annual international ACM SIGIR (Special Interest Group on Information Retrieval) conference on research and development in information retrieval<address><addrLine>Seattle, Washington, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006-08-06">2006. Aug. 6-11 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Computational methods in authorship attribution</title>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Argamon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science and Technology</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="9" to="26" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Automatically Profiling the Author of an Anonymous Text</title>
		<author>
			<persName><forename type="first">S</forename><surname>Argamon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Koppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Pennebaker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Schler</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">52</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="119" to="123" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Mining E-mail Content for Author Identification Forensics</title>
		<author>
			<persName><surname>De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Corney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Mohay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Special Interest Group on Management of Data</title>
				<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="55" to="64" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Future trends in authorship attribution</title>
		<author>
			<persName><forename type="first">P</forename><surname>Juola</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Federation for Information Processing</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="119" to="132" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A novel approach of mining write-prints for authorship attribution in e-mail forensics</title>
		<author>
			<persName><forename type="first">F</forename><surname>Iqbal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Hadjidj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">C</forename><surname>Fung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Debbabi</surname></persName>
		</author>
		<ptr target="www.elsevier.com/locate/diin.2008.05.001" />
	</analytic>
	<monogr>
		<title level="j">Digital Forensic Research Workshop</title>
		<imprint>
			<date type="published" when="2008-11-16">2008. 2008. Nov. 16, 2009</date>
			<publisher>Elsevier Ltd</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Stylometry and the civil war: the case of the Pickett letters</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">I</forename><surname>Holmes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CHANCE</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="18" to="25" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">N G</forename><surname>Binongo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CHANCE</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="9" to="17" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">The application of principal component analysis to stylometry</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">N G</forename><surname>Binongo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W A</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Literary and Linguistic Computing</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="445" to="466" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Visualizing authorship for identification</title>
		<author>
			<persName><forename type="first">A</forename><surname>Abbasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Lecture Notes in Computer Science</title>
		<editor>Mehrotra, S., Zeng, D.D., Chen, H., Thuraisingham, B., Wang</editor>
		<imprint>
			<biblScope unit="page" from="60" to="71" />
			<date type="published" when="2006">2006</date>
			<publisher>Springer-Verlag</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Writeprints: a stylometric approach to identity level identification and similarity detection in cyberspace</title>
		<author>
			<persName><forename type="first">A</forename><surname>Abbasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<idno type="DOI">10.1145/1344411.1344413</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Information Systems</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Frank</surname></persName>
		</author>
		<title level="m">Data mining: practical machine learning tools and techniques</title>
				<meeting><address><addrLine>USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann publishers</publisher>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
	<note>2 nd ed</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">O</forename><surname>Kujore</surname></persName>
		</author>
		<title level="m">English usage: some notable Nigerian variations</title>
				<imprint>
			<publisher>Evans Brothers Nigeria Publishers Limited</publisher>
			<date type="published" when="1985">1985</date>
			<biblScope unit="page" from="1" to="112" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Jowitt</surname></persName>
		</author>
		<title level="m">Nigerian English usage: An Introduction</title>
				<imprint>
			<publisher>Longman</publisher>
			<date type="published" when="1991">1991</date>
			<biblScope unit="page" from="1" to="277" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Application of Dimensional Analysis in Systems Modeling and Control Design</title>
		<author>
			<persName><forename type="first">P</forename><surname>Balaguer</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
		<respStmt>
			<orgName>The Institution of Engineering and Technology</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<title level="m" type="main">Applied Dimensional Analysis and Modeling</title>
		<author>
			<persName><forename type="first">T</forename><surname>Szirtes</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007</date>
			<publisher>Elsevier/Butterworth-Heinemann</publisher>
			<pubPlace>Amsterdam</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">A Cybercrime Forensic Method for Chinese Web Information Authorship Analysis</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Teng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">PAISI 2009</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</editor>
		<meeting><address><addrLine>Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">5477</biblScope>
			<biblScope unit="page" from="14" to="24" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Comparative Analysis of Idiosyncrasy, Content and Function Word Distributions in the English Language Variants of Selected African Countries</title>
		<author>
			<persName><forename type="first">A</forename><surname>Opesade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Adegbola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tiamiyu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computational Linguistics Research</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="130" to="143" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
