<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimizing Authorship Profiling of Online Messages</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Adeola O. Opesade Africa Regional Centre for Information Science, University of Ibadan</institution>
          ,
          <country country="NG">Nigeria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>7</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>Authorship profiling is of growing importance in the current information age, partly due to its application in digital forensics. Methodologies of profiling like any other authorship analysis consist majorly of feature extraction and application of analytical techniques. Choice of feature sets and analytical techniques may significantly affect the performance of authorship analysis. Hence, a need for methods that can help improve on the success of authorship profiling undertakings. The present study sought through experiments, the writing features, analytical technique and number of class labels that can help improve the effectiveness of profiling the country of affiliation of authors of online messages. The experiment showed that the most effective model was achieved when all feature set types in our study were used within a two-class dataset that was analysed with the Neural Network (Multilayer Perceptron) machine learning scheme. The study recommends a need for further studies in finding models that can maximize both effectiveness and efficiency in profiling the authorship of online messages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Authorship profiling</kwd>
        <kwd>Machine learning</kwd>
        <kwd>Computational linguistics</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Nigerian English</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Electronic messages are extensively used to distribute information
over such channels as e-mail, Internet newsgroups, Internet chat
rooms, Internet forums and other user generated contents on the
Web. These messages are quite different from other forms of
writings particularly, because of their brevity. Unfortunately,
unethical hands and criminals exploit the convenience of these
media to carry out their obnoxious goals. Digital forensics require
the use of scientifically derived and proven methods towards the
preservation, collection, validation, identification, analysis,
interpretation, documentation and presentation of digital evidence
derived from digital sources for litigation purposes.</p>
      <p>
        Authorship profiling is one of the major classes of authorship
attribution problems. It seeks the demographic or psychological
group of the author of an anonymous text. Its application in
forensics and digital security has made it to be of growing
importance in the present information age. Methodologies of
profiling like any other authorship analysis consist majorly of
feature extraction and application of analytical techniques. Choice
of feature sets and analytical techniques may significantly affect
the performance of authorship analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; thus, studies into
optimization of authorship profiling of online messages can assist
in improving the success of identifying sources of security threats
perpetrated through web-based channels.
      </p>
      <p>
        A number of previous studies ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; [22]; [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) have investigated
some parameters that could affect the effectiveness of authorship
attribution undertakings. These studies, however, focused on
authorship identification problem and not on authorship profiling.
Considering the potential of authorship profiling in investigating
transnational digital breaches, the present study seeks to find
through experiments the writing-style features, classification
techniques as well as possible number of class options that can
maximize the effectiveness of profiling the authorship of electronic
messages. The following research questions were pursued in order
to achieve the purpose the study:
Research Question 1: Which feature type set maximizes the
effectiveness of profiling the country of affiliation of writers of
online messages?
Research Question 2: Which classification scheme maximizes the
effectiveness of profiling the country of affiliation of writers of
online messages?
Research Question 3: Which class labelling option maximizes the
effectiveness of profiling the country of affiliation of writers of
online messages?
Research Question 4: What is the performance of the resultant
model in classifying electronic messages to writers' countries of
affiliation?
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. LITERATURE REVIEW</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Authorship Attribution Problems</title>
      <p>
        Authorship attribution is a process of examining the characteristics
of a piece of writing in order to draw conclusions about its author.
Authorship attribution problems vary in complexity. They have
been categorized into three major classes, namely, authorship
identification, authorship profiling and authorship verification. The
most straightforward version of these three is the identification
problem which involves the determination of the actual author of a
given text among a small set of candidate authors. Given a set of
writings of a number of authors, the task in authorship
identification is to assign a new piece of writing to one of them [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
In authorship verification, there is no closed candidate set but there
is one suspect and the challenge is to determine if the suspect is or
is not the author. In this case, examples of the writing of a single
author are given and the task is to verify that a given target text
was or was not written by this author. Hence, verification can be
thought of as a one-class classification problem and it is
significantly more difficult than basic authorship identification
problem [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In authorship profiling (also known as authorship characterization
problem) there is no candidate set at all; the challenge is to provide
as much demographic or psychological information as possible
about the author. Unlike the identification problem, authorship
profiling does not begin with a set of writing samples from known
candidate authors. Instead, it exploits the sociolinguistic
observation that different groups of people speaking or writing in a
particular genre and in a particular language, use that language
differently; that is, they vary in how often they use certain words
or syntactic constructions in addition to variation in pronunciation
or intonation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Profiling problem is concerned with determining
such characteristics as gender, educational and cultural
backgrounds, language familiarity and so on of the author that
produced a piece of work. This is a harder problem than the
identification problem since it characterizes the writing style of a
set of writers rather than the unique style of a single person [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Despite variations in the complexities of authorship problems,
choices of appropriate linguistic features and analytical techniques
are paramount.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Authorship Attribution Methods</title>
      <p>
        One of the main components of authorship attribution methods is
the extraction of linguistic features that represent the writing style
of an author or author group. Language, like genetics, can be
characterized by a very large set of potential features that may or
may not show up in any specific sample, and that may or may not
have obvious large-scale impact. By identifying the features
characteristic of a group or individual of interest, and then finding
those features in an anonymous document, one can support a
finding that the document was written by that person or a member
of that group [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The various feature sets, otherwise known as
feature metrics in computational linguistics can be classified into
four main classes, which are the lexical, syntactical,
contentspecific and structural features [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Researchers vary in their
choices of linguistic features; while some used feature(s) that
belong to a single class (for example, [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]; [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]; [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]; and [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
others (such as [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]; [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]; [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]; [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]; [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]; [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) used features
across multiple feature classes.
      </p>
      <p>
        The second component is the application of analytical techniques
to feature sets for supervised or unsupervised learning. Different
analytical techniques have been used in previous authorship
attribution studies. These techniques can be classified into three,
namely, the unitary invariant, multivariate and machine learning
approaches [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Machine learning examines previous examples and
their outcomes and learns how to reproduce these and make
generalisations about new cases. Machine learning algorithms
differ in terms of level of data and abilities to resolve data
ambiguities such as noise or missing data. Machine learning
techniques include rule based algorithms such as OneR, neural
networks such as Multilayer Perceptron, statistical modelling
algorithm such as Naive Bayes, decision trees such as J48, linear
models such as linear regression and Support Vector Machine and
instance-based learning algorithm such as Nearest Neighbour.
      </p>
      <p>
        Unlike in the choice of feature sets, researchers are less varied in
their choices of analytical techniques. While older studies tend to
favour the use of Principal Component Analysis, the more recent
ones tend towards the use of Support Vector Machine. Most
previous studies reported the use of only a single analytical
technique. Considering such statement as made by [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
      </p>
      <p>Experience shows that no single machine
learning scheme is appropriate to all data
mining problems. The universal learner is
an idealistic fantasy. Real datasets vary and
to obtain accurate models, the bias of the
learning algorithm must match the structure
of the domain. Data mining is an
experimental science (pg 365).</p>
      <p>Choice of machine learning scheme should be based on the result
of a prior experiment that validates its suitability to the dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>2.3 Related Authorship Studies</title>
      <p>
        A number of previous studies have shown relative performances of
a number of feature types and analytical techniques in authorship
analyses. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] studied the results of authorship identification using
many authors and limited data on learning. Their result showed
that systematically increasing the amount of authors under
investigation led to a significant decrease in performance. Their
study also revealed that providing a more heterogeneous set of
features improves the system significantly. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] investigated the
types of writing-style features and classification techniques that
were effective for identifying the authorship of online messages.
They reported that the accuracy kept increasing as more types of
features were used and that Support Vector Machine (SVM)
outperformed Neural Networks (NN), which in turn outperformed
the C4.5 classifier. The best accuracy was achieved when SVM
and all feature types were used but classifier performance reduced
as the number of authors increased. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] through experiment
demonstrated that inclusion of stylistic idiosyncrasy features to
letter n-grams, function words and to a combination of n-grams
and function words consistently led to improved accuracy for
identifying the native language of the author of a given English
language text.
      </p>
      <p>
        The studies of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are situated within the identification
domain of authorship attribution problems because they started
with a close number of candidate authors, while that of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was a
profiling problem. However, their focus was majorly to show the
ability of idiosyncrasies in detecting writer's native language. It
therefore, did not address some of the salient issues covered by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
which are relative performances of analytical techniques and effect
of increasing the number of candidate authors. Also, the corpus
used by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was the International Corpus of Learner English
(ICLE) which had between 579 and 846 words. These numbers
were quite high for an online message, which are usually very
short. The present study focuses on shorter texts which characterise
online messages. Therefore, the present study seeks to find the
writing-style (linguistic) features, classification techniques as well
as possible number of class options that can maximize the
effectiveness of profiling the native language of the author of an
online message.
      </p>
    </sec>
    <sec id="sec-6">
      <title>3. EXPERIMENTATION FOR OPTIMIZING</title>
    </sec>
    <sec id="sec-7">
      <title>AUTHORSHIP PROFILING OF ONLINE</title>
    </sec>
    <sec id="sec-8">
      <title>MESSAGES</title>
    </sec>
    <sec id="sec-9">
      <title>3.1 Problem formulation</title>
      <p>Given a number of online messages written in English language by
nationals of selected African countries, namely, Cameroon, Ghana,
Liberia, Nigeria and Sierra-Leone. The goal is to find the types of
writing-style features, the classification technique as well as
possible number of class options that can maximize the
effectiveness of profiling the linguistic origin of anonymous
electronic texts written by the nationals of any of the selected
countries.</p>
    </sec>
    <sec id="sec-10">
      <title>3.2 Research Method</title>
      <p>A multistage sampling technique was used to select a
representative sample of electronic texts from the population of
texts contained in the relevant country pages of the website
www.topix.com. To get the texts that could be useful for a
supervised learning approach of the study, each text was opened,
read and assessed based on the number of words contained and a
sense of affiliation to the respective country as depicted in the
content. A comment was considered to be affiliated to (and
labelled to be from) a particular country if it was found in that
country's forum and if it contained such phrases as 'our country',
'our beloved country' and other related ones in its discourse.
Initially the researcher targeted selecting texts with a hundred or
more words; however, this was reduced to texts with twenty (20)
or more words because of the scarcity of large texts on the
discussion forums. The numbers of texts selected for the study in
November 2011 and based on the assessment criteria are as shown
in Table 1.
3.2.1 Text Pre-processing and Processing
The corpora were subjected to pre-processing in order to put them
in the format expected by the relevant software for text processing.
The pre-processing tasks included deletion of e-mail headers,
removal of control codes, text aggregation, and removal of
nonASCII characters. Text processing was achieved by extracting
linguistic features from the sampled texts using computer codes
written by the researcher in Python 2.6.4 programming language,
based on the natural language toolkit (NLTK) version 2.0. Some of
the specific issues handled in the course of text processing were
tokenization, part of speech tagging and linguistic feature
extraction.</p>
      <p>
        Although there is no agreement on a best set of features for a wide
range of application domains, selected feature metrics must be
reliable characteristic of attribution domain [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Certain features
were extracted in the present study, based on their relevance as
determined from relevant literature on authorship attibution and
Nigerian Englishes ([
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]; [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]). Extracted features were syntactic
features comprising the twenty (20) most frequent function words
in the topix.com corpus, Idiosyncratic features comprising
frequency of occurrence of spelling errors, adverb-verb part of
speech (POS) bigram distribution and article omission/inclusion
distribution. Structural features comprising lexical diversity; and
content specific features consisting of twenty (20) most frequent
noun, adjective, verb and adverb unigrams in the topix.com corpus.
The features extracted and their denotations are as shown in Table
2.
      </p>
      <p>Table 2: Extracted Linguistic Features</p>
      <sec id="sec-10-1">
        <title>Feature type Feature metric Denotation</title>
        <sec id="sec-10-1-1">
          <title>Lexical</title>
        </sec>
        <sec id="sec-10-1-2">
          <title>Syntactic</title>
        </sec>
        <sec id="sec-10-1-3">
          <title>Idiosyncrasies</title>
        </sec>
        <sec id="sec-10-1-4">
          <title>Content specific</title>
        </sec>
        <sec id="sec-10-1-5">
          <title>Vocabulary richness</title>
        </sec>
        <sec id="sec-10-1-6">
          <title>Probabilities of occurrence of</title>
          <p>most occurring function words
Probabilities of occurrence of
article deletion, verb -adverb
sequence and spelling errors.</p>
          <p>Noun unigrams, adjective
unigrams, verb unigrams,
adverb unigrams.</p>
          <p>
            F1
F2
F3
F4
The decision to extract twenty most frequent features (function
word, noun, adjective, verb and adverb unigrams) was as a result
of a prior experiment which showed that the summation of the
frequencies of occurrence of the twenty most frequent features
accounted for at least 60% of the cumulative frequency of all
features extracted in each case.
3.3 Experimental Setup
i. Class Labelling: According to the study of [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] learner’s
performance changes with number of candidate authors. To find
out the effect of varying the number of classes on the
classification performance in the present study, the dataset was
copied into three different files having all parameters being the
same except the class labels. The class labels were controlled as
presented in Table 3.
          </p>
        </sec>
        <sec id="sec-10-1-7">
          <title>Dataset1</title>
        </sec>
        <sec id="sec-10-1-8">
          <title>Dataset 2</title>
          <p>The texts in Dataset 1 bear their original class labels, that is, the
actual countries of affiliation of the writers as determined from the
forums and the texts. There are therefore five different class labels,
representing the five country sources of the texts. Dataset 2 has
three class labels; texts from Nigeria and Ghana bear their original
country source labels while those from the other three countries
were combined and labelled 'Non-Ghana-Nigeria'. This was
informed by a previous study that showed varying degrees of
similarity in the English language usage among the selected
countries. Dataset 3 labelled texts from Nigeria as Nigeria while
texts from the other four countries were combined under the label '
Non-Nigeria'. This was done to achieve a two-class dataset option.
Experiments were carried out using the Experimenter interface of
the open source Waikato Environment for Knowledge Analysis
(WEKA) machine learning tool. In this study, four machine
learning algorithm implementations in WEKA namely naïve
Bayes, SMO (SVM implementation), J48 and Multilayer
perceptron (Neural network implementation) were used. The
experiment was carried out to compare the performances classifier
models in the phase of:
a. Changing the number of classes.
b. Changing the linguistic feature sets.</p>
          <p>c. Changing classifier algorithms.</p>
          <p>Each of the three datasets (Dataset 1, Dataset 2 and Dataset 3) with
each of the feature set types (F1, F2, F3, F4) and all their possible
combinations (F1+F2, F1+F2+F3, F1+F2+F3+F4, F1+F2+F4,
F1+F3, F1+F4, F2+F3, F2+F3+F4, F2+F4, F3+F4, F3+F4+F1)
were analysed using the four machine learning algorithms.
Ten fold cross validation was used to evaluate the models'
performances based on percent correct (percentage of all datasets
that are classified correctly) and Kappa statistic (measure of the
agreement between predicted and observed categorization, while
correcting for agreement that happens by chance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>3.4 Evaluation of the Experiments</title>
      <p>Tables in Appendix 1 show the percent correct and kappa statistic
values derived for each of the datasets in our experiment. The
results are presented successively for Naive Bayes, SMO, J48 and
multilayer perceptron. It could be observed from the tables that the
percent correct values appear to be highest for Dataset 3 while
Kappa statistics appear to be highest for Dataset 2. This
observation cuts across virtually all features sets and classifiers.
This implies that classifiers were better able to classify Dataset 3
correctly compared to other datasets while classifications achieved
in Dataset 2 gave better agreement between predicted and observed
categorization having corrected for agreement that happened by
chance. Worthy to be noted is the result of SMO in Dataset 3,
although the percent correct values were relatively high, Kappa
statistics were all zero. Lack of coherence in the directions of the
two performance measures led us to using the product of the two
measures (percent correct and kappa statistic) as a basis for
comparing models' performances.</p>
      <p>
        This decision to use the product was informed by the theory of
Dimensional Analysis which is a problem-solving method that uses
the fact that any number or expression can be multiplied by one
without changing its value. One can only meaningfully add or
subtract quantities of the same type but can multiply or divide
quantities of different types. When two measurements are
multiplied together the product is of a type depending on the types
of the measurements. This analysis is routinely applied in physics
and it is an engineering tool that is widely applied to numerous
engineering problems for designing and testing all types of
engineering and physical systems ([
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]; [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]). The result of the dot
products of the two measures is as presented in Appendix 2. The
table in Appendix 2 presents the performances of our models
taking into consideration the two performance measures. We
consider this table more representative of the models'
performances because it combines the strengths and weaknesses of
the two performance measures. Answers to research questions will,
therefore, be based on the content of this table.
      </p>
    </sec>
    <sec id="sec-12">
      <title>4. RESULTS AND DISCUSSION</title>
      <sec id="sec-12-1">
        <title>Research Question 1: Which feature set type maximizes the effectiveness of profiling the country of affiliation of writers of online messages?</title>
        <p>
          Across all the three datasets, the feature set that combined all
feature types (F1+F2+F3+F4) performed best. This is followed by
(F2+F4), (F2+F3+F4) and (F1+F2+F3), while the performance of
F1 was the least. Our result shows that inclusion of all features
from all the four types (lexical, syntactic, idiosyncrasies and
content specific) produced the most effective model. Again the
result was consistent with those of [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] pg 365 who
reported that combining feature types in their studies gave a better
result. Using vocabulary richness only produced the poorest result
probably because of the short length of online messages in the
study.
        </p>
      </sec>
      <sec id="sec-12-2">
        <title>Research Question 2: Which classification scheme maximizes the effectiveness of profiling the country of affiliation of writers of online messages?</title>
        <p>Figure 2 shows the relative performances of the four classifiers
across all feature types (F1+F2+F3+F4) and datasets.</p>
        <p>
          Neural Network (multilayer perceptron) performed best when
compared to the other three classifiers. Its performance was
particularly the highest on the feature set (F1+F2+F3+F4)
contained in our two-class option dataset (Dataset 3). Most
previous studies considered SVM most appropriate in authorship
attribution (though most times without carrying out a prior
experiment). [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] however, reported that there were no significant
performance differences between SVM and neural networks. It
could be observed that SVM implementation (SMO) outperformed
the other three classifiers when the texts contained their natural
class labels (Dataset 1) and performed most terribly on Dataset 3.
This corroborates the submission of [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] that no single machine
learning scheme is appropriate to all data mining problems because
real datasets vary and to obtain accurate models, the bias of the
learning algorithm must match the structure of the domain.
Meaning that the structure of our Dataset 3 is most amenable to
neural network than any of the other machine learning schemes
(Naive Bayes, SMO, J48) in our study. Worthy of note also is the
usefulness of our application of the dimensional analysis principle
which informed the multiplication of the two performance
measures in our study. For example, if our comparison had been
based on percent correct (in Appendix 1) only, we might have
erroneously rated the performance of SMO relatively high on
Dataset 3.
        </p>
      </sec>
      <sec id="sec-12-3">
        <title>Research Question 3: Which class labelling option maximizes the effectiveness of profiling the country of affiliation of writers of online messages?</title>
        <p>
          Fig. 3 shows the percent correct values derived for each of the
datasets in our experiment using the most precise classification
scheme (Neural Network) and all feature sets (F1+F2+F3+F4)
only. The results are presented successively for Naive Bayes,
SMO, J48 and Neural Network.
The figure shows that the dataset having two class options (Dataset
3) performed best followed by the one having three class options
(Dataset 2) and lastly the one having the instances labelled
naturally, having five classes (Dataset 1). The result is consistent
with those of [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that reported that authorship attribution
success improves with reduction in the number of authors or author
classes. In the specific however, the present result shows that if we
can reduce an authorship profiling problem to a two-class one, we
can get an appreciable improvement in the effectiveness of
authorship profiling task.
        </p>
      </sec>
      <sec id="sec-12-4">
        <title>Research Question 4: What is the performance of the resultant model in classifying electronic messages to writers' countries of affiliation?</title>
        <p>Using the TrainTestSplitMaker component of WEKA's knowledge
flow interface to evaluate the performance of our model in
classifying electronic messages to writers' countries of affiliation.
Separate two-class label file was created for each country, resulting
in a dataset for each country, where all attributes except the class
attribute were the same. The class attribute for a particular country
had instances labelled either as 'the country name' such as (Nigeria,
Ghana, Cameroon) or as 'non country name' such as (Non-Nigeria,
Non-Ghana, Non-Cameroon). Tables 4 shows the effectiveness of
profiling authors' countries of affiliation by the resultant model.
Application of our optimization method resulted in a remarkable
improvement in the profiling of each country from the others. The
study showed that we could achieve a percent correct ranging
between 70.8% and 88.2% at Kappa statistics ranging between
0.04 and 0.34 compared to the highest possible percent correct
value of 43.8% at kappa statistics of 0.26% if our method was not
applied. This however is a trade-off on the efficiency of the
profiling process because we needed to create separate labels for
the class attribute. The extent of improvement in model
performance however can be said to outweigh the additional effort.
The detailed performance of the model is as shown in Table 5.
The resultant model performed well when we consider the
weighted averages of the performance measures of each dataset. It
could however, be observed that the model was better at
identifying texts that were not from the country as against those
that were from the country in each case. It could also be observed
that the performance of the model in predicting each country's
texts vary directly with the number of each country's texts in the
study corpus. The best performance was achieved in profiling
Nigerian electronic texts from Non Nigeria texts, followed by that
of Sierra Leone and then Ghana. Thus, it could be deduced that
performance of our model could be much improved with bigger
sub-corpora sizes.</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>5. CONCLUSION</title>
      <p>The study through experiments sought the number of class options,
feature set types and machine learning scheme that maximize the
effectiveness of identifying the countries of affiliation of authors of
online messages composed in English language. The online
messages in our corpus were collected from online forums of five
African countries with average length of 52 to 102 words. Using a
product of percent correct and kappa statistics as our bases for
model justification, the experiment showed that we achieved the
most effective model when all feature set types, contained in a
two-class dataset was analysed with the neural network (multilayer
perceptron) machine learning scheme. Application of the
parameters of the most effective model (derived from the
experiment) to profiling the countries of affiliation of authors of
the online messages resulted in about a hundred percent
improvement in effectiveness.</p>
      <p>The study achieved greater effectiveness but with a trade-off on
efficiency. We look forward to having a model that can maximize
both effectiveness and efficiency in profiling the authorship of
online messages, and this constitutes a need for further studies.
This approach in its present state can be very appropriate if a group
is suspected and the purpose of authorship attribution is to affirm
one's thought about the suspect's group of affiliation.</p>
      <sec id="sec-13-1">
        <title>Dataset 1</title>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Appendix 1: Experiment Result</title>
      <p>SMO
PC</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>A framework for authorship identification of online messages: writingstyle features and classification techniques</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>57</volume>
          (
          <issue>3</issue>
          ).
          <fpage>378</fpage>
          -
          <lpage>393</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zigdon</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Automatically determining an anonymous author's native language</article-title>
          .
          <source>Lecture Notes in Computer Science (LNCS)</source>
          3495. Eds. Kantor,
          <string-name>
            <given-names>P.B.</given-names>
            ,
            <surname>Muresan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.D.</given-names>
            and
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          : ISI
          <year>2005</year>
          , Berlin: Springer-Verlag.
          <fpage>209</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Luyckx</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Authorship attribution and verification with many authors and limited data</article-title>
          .
          <source>In: Proceedings of the 22nd International Conference on Computational Linguistics held in Manchester from 18-22 August</source>
          <year>2008</year>
          .
          <fpage>513</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Messeri</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Authorship attribution with thousands of candidate authors</article-title>
          .
          <source>In: Proceedings of the 29th annual international ACM</source>
          SIGIR (Special Interest Group on Information Retrieval)
          <article-title>conference on research and development in information retrieval</article-title>
          .
          <source>Aug. 6-11</source>
          <year>2006</year>
          , Seattle, Washington, USA.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Computational methods in authorship attribution</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>1</issue>
          ).
          <fpage>9</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennebaker</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Automatically Profiling the Author of an Anonymous Text</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>52</volume>
          (
          <issue>2</issue>
          ).
          <fpage>119</fpage>
          -
          <lpage>123</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>De</given-names>
            <surname>Vel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ,
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Corney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            and
            <surname>Mohay</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>Mining E-mail Content for Author Identification Forensics</article-title>
          . Special Interest Group on
          <source>Management of Data (ACM SIGMOD) Record</source>
          <volume>30</volume>
          (
          <issue>4</issue>
          ).
          <fpage>55</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Future trends in authorship attribution</article-title>
          .
          <source>International Federation for Information Processing</source>
          <volume>24</volume>
          (
          <issue>2</issue>
          ).
          <fpage>119</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hadjidj</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fung</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          <article-title>M and Debbabi</article-title>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2008</year>
          .
          <article-title>A novel approach of mining write-prints for authorship attribution in e-mail forensics</article-title>
          .
          <source>2008 Digital Forensic Research Workshop. Elsevier Ltd. Retrieved Nov</source>
          .
          <volume>16</volume>
          ,
          <year>2009</year>
          , from www.elsevier.com/locate/diin.
          <year>2008</year>
          .
          <volume>05</volume>
          .001
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Stylometry and the civil war: the case of the Pickett letters</article-title>
          .
          <source>CHANCE</source>
          <volume>16</volume>
          (
          <issue>2</issue>
          )
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Binongo</surname>
            ,
            <given-names>J.N.G.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution</article-title>
          .
          <source>CHANCE</source>
          <volume>16</volume>
          (
          <issue>2</issue>
          ) .
          <fpage>9</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Binongo</surname>
            ,
            <given-names>J.N.G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Smith</surname>
            <given-names>M.W.A.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>The application of principal component analysis to stylometry</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>14</volume>
          (
          <issue>4</issue>
          ).
          <fpage>445</fpage>
          -
          <lpage>466</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Abbasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Visualizing authorship for identification</article-title>
          .
          <source>Lecture Notes in Computer Science (LNCS)</source>
          3975. Eds. Mehrotra,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.D.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Thuraisingham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>F</surname>
          </string-name>
          . Berlin: Springer-Verlag.
          <fpage>60</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Abbasi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Writeprints: a stylometric approach to identity level identification and similarity detection in cyberspace</article-title>
          .
          <source>ACM Transactions on Information Systems</source>
          .
          <volume>26</volume>
          (
          <issue>2</issue>
          ). doi:
          <volume>10</volume>
          .1145/1344411.1344413.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Data mining: practical machine learning tools and techniques</article-title>
          . 2nd ed. USA: Morgan Kaufmann publishers.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Kujore</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <year>1985</year>
          .
          <article-title>English usage: some notable Nigerian variations</article-title>
          .
          <fpage>1</fpage>
          -
          <lpage>112</lpage>
          . Nigeria: Evans Brothers Nigeria Publishers Limited.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Jowitt</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>1991</year>
          .
          <article-title>Nigerian English usage: An Introduction</article-title>
          .
          <fpage>1</fpage>
          -
          <lpage>277</lpage>
          . Nigeria: Longman.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Balaguer</surname>
            <given-names>P 2013</given-names>
          </string-name>
          <article-title>Application of Dimensional Analysis in Systems Modeling and Control Design, The Institution of Engineering</article-title>
          and Technology;
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Szirtes</surname>
            <given-names>T 2007</given-names>
          </string-name>
          <string-name>
            <surname>Applied Dimensional</surname>
          </string-name>
          <article-title>Analysis and Modeling</article-title>
          . Elsevier/Butterworth-Heinemann Amsterdam; New York.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teng</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>y.</given-names>
            and
            <surname>Li</surname>
          </string-name>
          <string-name>
            <surname>Y</surname>
          </string-name>
          (
          <year>2009</year>
          )
          <article-title>A Cybercrime Forensic Method for Chinese Web Information Authorship Analysis</article-title>
          .
          <source>In: PAISI</source>
          <year>2009</year>
          , LNCS 5477 pp.
          <fpage>14</fpage>
          -
          <lpage>24</lpage>
          . H.
          <string-name>
            <surname>Chen</surname>
          </string-name>
          et al. (Eds.). Springer-Verlag Berlin Heidelberg
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Opesade</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adegbola</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tiamiyu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Comparative Analysis of Idiosyncrasy, Content and Function Word Distributions in the English Language Variants of Selected African Countries</article-title>
          .
          <source>International Journal of Computational Linguistics Research</source>
          Vol.
          <volume>4</volume>
          (
          <issue>3</issue>
          ) pp.
          <fpage>130</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>