=Paper= {{Paper |id=Vol-1755/96-103 |storemode=property |title=Optimizing Authorship Profiling of Online Messages |pdfUrl=https://ceur-ws.org/Vol-1755/96-103.pdf |volume=Vol-1755 |authors=Adeola Opesade |dblpUrl=https://dblp.org/rec/conf/cori/Opesade16 }} ==Optimizing Authorship Profiling of Online Messages== https://ceur-ws.org/Vol-1755/96-103.pdf
        Optimizing Authorship Profiling of Online Messages
                                                         Adeola O. Opesade
                                            Africa Regional Centre for Information Science,
                                                     University of Ibadan, Nigeria
                                                       morecrown@gmail.com

ABSTRACT                                                                    group of the author of an anonymous text. Its application in
Authorship profiling is of growing importance in the current                forensics and digital security has made it to be of growing
information age, partly due to its application in digital forensics.        importance in the present information age. Methodologies of
Methodologies of profiling like any other authorship analysis               profiling like any other authorship analysis consist majorly of
consist majorly of feature extraction and application of analytical         feature extraction and application of analytical techniques. Choice
techniques. Choice of feature sets and analytical techniques may            of feature sets and analytical techniques may significantly affect
significantly affect the performance of authorship analysis. Hence,         the performance of authorship analysis [1]; thus, studies into
a need for methods that can help improve on the success of                  optimization of authorship profiling of online messages can assist
authorship profiling undertakings. The present study sought                 in improving the success of identifying sources of security threats
through experiments, the writing features, analytical technique and         perpetrated through web-based channels.
number of class labels that can help improve the effectiveness of
profiling the country of affiliation of authors of online messages.         A number of previous studies ([1]; [22]; [3]) have investigated
The experiment showed that the most effective model was                     some parameters that could affect the effectiveness of authorship
achieved when all feature set types in our study were used within a         attribution undertakings. These studies, however, focused on
two-class dataset that was analysed with the Neural Network                 authorship identification problem and not on authorship profiling.
(Multilayer Perceptron) machine learning scheme. The study                  Considering the potential of authorship profiling in investigating
recommends a need for further studies in finding models that can            transnational digital breaches, the present study seeks to find
maximize both effectiveness and efficiency in profiling the                 through experiments the writing-style features, classification
authorship of online messages.                                              techniques as well as possible number of class options that can
                                                                            maximize the effectiveness of profiling the authorship of electronic
                                                                            messages. The following research questions were pursued in order
CCS Concepts                                                                to achieve the purpose the study:
• General and reference ➝ Cross-computing tools and
techniques ➝Experimentation                                                 Research Question 1: Which feature type set maximizes the
                                                                            effectiveness of profiling the country of affiliation of writers of
Keywords                                                                    online messages?
Authorship profiling, Machine learning, Computational linguistics,
Natural Language Processing, Nigerian English                               Research Question 2: Which classification scheme maximizes the
                                                                            effectiveness of profiling the country of affiliation of writers of
                                                                            online messages?
1. INTRODUCTION
Electronic messages are extensively used to distribute information          Research Question 3: Which class labelling option maximizes the
over such channels as e-mail, Internet newsgroups, Internet chat            effectiveness of profiling the country of affiliation of writers of
rooms, Internet forums and other user generated contents on the             online messages?
Web. These messages are quite different from other forms of
writings particularly, because of their brevity. Unfortunately,             Research Question 4: What is the performance of the resultant
unethical hands and criminals exploit the convenience of these              model in classifying electronic messages to writers' countries of
media to carry out their obnoxious goals. Digital forensics require         affiliation?
the use of scientifically derived and proven methods towards the
preservation, collection, validation, identification, analysis,
interpretation, documentation and presentation of digital evidence
                                                                            2. LITERATURE REVIEW
derived from digital sources for litigation purposes.                       2.1 Authorship Attribution Problems
                                                                            Authorship attribution is a process of examining the characteristics
Authorship profiling is one of the major classes of authorship              of a piece of writing in order to draw conclusions about its author.
attribution problems. It seeks the demographic or psychological             Authorship attribution problems vary in complexity. They have
                                                                            been categorized into three major classes, namely, authorship
                                                                            identification, authorship profiling and authorship verification. The
                                                                            most straightforward version of these three is the identification
                                                                            problem which involves the determination of the actual author of a
                                                                            given text among a small set of candidate authors. Given a set of
                                                                            writings of a number of authors, the task in authorship
CoRI’16, Sept 7–9, 2016, Ibadan, Nigeria.                                   identification is to assign a new piece of writing to one of them [4].
                                                                            In authorship verification, there is no closed candidate set but there
                                                                            is one suspect and the challenge is to determine if the suspect is or
                                                                            is not the author. In this case, examples of the writing of a single



                                                                       96
author are given and the task is to verify that a given target text            Unlike in the choice of feature sets, researchers are less varied in
was or was not written by this author. Hence, verification can be              their choices of analytical techniques. While older studies tend to
thought of as a one-class classification problem and it is                     favour the use of Principal Component Analysis, the more recent
significantly more difficult than basic authorship identification              ones tend towards the use of Support Vector Machine. Most
problem [5].                                                                   previous studies reported the use of only a single analytical
                                                                               technique. Considering such statement as made by [15].
In authorship profiling (also known as authorship characterization
problem) there is no candidate set at all; the challenge is to provide                     Experience shows that no single machine
as much demographic or psychological information as possible                               learning scheme is appropriate to all data
about the author. Unlike the identification problem, authorship                            mining problems. The universal learner is
profiling does not begin with a set of writing samples from known                          an idealistic fantasy. Real datasets vary and
candidate authors. Instead, it exploits the sociolinguistic                                to obtain accurate models, the bias of the
observation that different groups of people speaking or writing in a                       learning algorithm must match the structure
particular genre and in a particular language, use that language                           of the domain. Data mining is an
differently; that is, they vary in how often they use certain words                        experimental science (pg 365).
or syntactic constructions in addition to variation in pronunciation
or intonation [6]. Profiling problem is concerned with determining             Choice of machine learning scheme should be based on the result
such characteristics as gender, educational and cultural                       of a prior experiment that validates its suitability to the dataset.
backgrounds, language familiarity and so on of the author that
produced a piece of work. This is a harder problem than the                    2.3 Related Authorship Studies
identification problem since it characterizes the writing style of a           A number of previous studies have shown relative performances of
set of writers rather than the unique style of a single person [7].            a number of feature types and analytical techniques in authorship
                                                                               analyses. [3] studied the results of authorship identification using
Despite variations in the complexities of authorship problems,                 many authors and limited data on learning. Their result showed
choices of appropriate linguistic features and analytical techniques           that systematically increasing the amount of authors under
are paramount.                                                                 investigation led to a significant decrease in performance. Their
                                                                               study also revealed that providing a more heterogeneous set of
2.2 Authorship Attribution Methods                                             features improves the system significantly. [1] investigated the
One of the main components of authorship attribution methods is                types of writing-style features and classification techniques that
the extraction of linguistic features that represent the writing style         were effective for identifying the authorship of online messages.
of an author or author group. Language, like genetics, can be                  They reported that the accuracy kept increasing as more types of
characterized by a very large set of potential features that may or            features were used and that Support Vector Machine (SVM)
may not show up in any specific sample, and that may or may not                outperformed Neural Networks (NN), which in turn outperformed
have obvious large-scale impact. By identifying the features                   the C4.5 classifier. The best accuracy was achieved when SVM
characteristic of a group or individual of interest, and then finding          and all feature types were used but classifier performance reduced
those features in an anonymous document, one can support a                     as the number of authors increased. [2] through experiment
finding that the document was written by that person or a member               demonstrated that inclusion of stylistic idiosyncrasy features to
of that group [8]. The various feature sets, otherwise known as                letter n-grams, function words and to a combination of n-grams
feature metrics in computational linguistics can be classified into            and function words consistently led to improved accuracy for
four main classes, which are the lexical, syntactical, content-                identifying the native language of the author of a given English
specific and structural features [9]. Researchers vary in their                language text.
choices of linguistic features; while some used feature(s) that
belong to a single class (for example, [10]; [11]; [12]; and [9],              The studies of [3] and [1] are situated within the identification
others (such as [6]; [2]; [4]; [3]; [7]; [1]; [13]; [14]) used features        domain of authorship attribution problems because they started
across multiple feature classes.                                               with a close number of candidate authors, while that of [2] was a
                                                                               profiling problem. However, their focus was majorly to show the
The second component is the application of analytical techniques               ability of idiosyncrasies in detecting writer's native language. It
to feature sets for supervised or unsupervised learning. Different             therefore, did not address some of the salient issues covered by [1]
analytical techniques have been used in previous authorship                    which are relative performances of analytical techniques and effect
attribution studies. These techniques can be classified into three,            of increasing the number of candidate authors. Also, the corpus
namely, the unitary invariant, multivariate and machine learning               used by [2] was the International Corpus of Learner English
approaches [8]. Machine learning examines previous examples and                (ICLE) which had between 579 and 846 words. These numbers
their outcomes and learns how to reproduce these and make                      were quite high for an online message, which are usually very
generalisations about new cases. Machine learning algorithms                   short. The present study focuses on shorter texts which characterise
differ in terms of level of data and abilities to resolve data                 online messages. Therefore, the present study seeks to find the
ambiguities such as noise or missing data. Machine learning                    writing-style (linguistic) features, classification techniques as well
techniques include rule based algorithms such as OneR, neural                  as possible number of class options that can maximize the
networks such as Multilayer Perceptron, statistical modelling                  effectiveness of profiling the native language of the author of an
algorithm such as Naive Bayes, decision trees such as J48, linear              online message.
models such as linear regression and Support Vector Machine and
instance-based learning algorithm such as Nearest Neighbour.




                                                                          97
3. EXPERIMENTATION FOR OPTIMIZING                                           reliable characteristic of attribution domain [21]. Certain features
                                                                            were extracted in the present study, based on their relevance as
AUTHORSHIP PROFILING OF ONLINE                                              determined from relevant literature on authorship attibution and
MESSAGES                                                                    Nigerian Englishes ([16]; [17]). Extracted features were syntactic
3.1 Problem formulation                                                     features comprising the twenty (20) most frequent function words
Given a number of online messages written in English language by            in the topix.com corpus,         Idiosyncratic features comprising
nationals of selected African countries, namely, Cameroon, Ghana,           frequency of occurrence of spelling errors, adverb-verb part of
Liberia, Nigeria and Sierra-Leone. The goal is to find the types of         speech (POS) bigram distribution and article omission/inclusion
writing-style features, the classification technique as well as             distribution. Structural features comprising lexical diversity; and
possible number of class options that can maximize the                      content specific features consisting of twenty (20) most frequent
effectiveness of profiling the linguistic origin of anonymous               noun, adjective, verb and adverb unigrams in the topix.com corpus.
electronic texts written by the nationals of any of the selected            The features extracted and their denotations are as shown in Table
countries.                                                                  2.
                                                                            Table 2: Extracted Linguistic Features
3.2 Research Method                                                           Feature type Feature metric                      Denotation
A multistage sampling technique was used to select a
representative sample of electronic texts from the population of              Lexical         Vocabulary richness               F1
texts contained in the relevant country pages of the website
www.topix.com. To get the texts that could be useful for a                   Syntactic       Probabilities of occurrence of    F2
supervised learning approach of the study, each text was opened,                             most occurring function words
read and assessed based on the number of words contained and a              Idiosyncrasies   Probabilities of occurrence of    F3
sense of affiliation to the respective country as depicted in the                            article deletion, verb -adverb
content. A comment was considered to be affiliated to (and                                   sequence and spelling errors.
labelled to be from) a particular country if it was found in that            Content         Noun unigrams, adjective          F4
country's forum and if it contained such phrases as 'our country',           specific        unigrams, verb unigrams,
'our beloved country' and other related ones in its discourse.                               adverb unigrams.
Initially the researcher targeted selecting texts with a hundred or         The decision to extract twenty most frequent features (function
more words; however, this was reduced to texts with twenty (20)             word, noun, adjective, verb and adverb unigrams) was as a result
or more words because of the scarcity of large texts on the                 of a prior experiment which showed that the summation of the
discussion forums. The numbers of texts selected for the study in           frequencies of occurrence of the twenty most frequent features
November 2011 and based on the assessment criteria are as shown             accounted for at least 60% of the cumulative frequency of all
in Table 1.                                                                 features extracted in each case.
                       Table 1: Training Data Set
                                                                            3.3 Experimental Setup
                                                                             i. Class Labelling: According to the study of [3] learner’s
  Country's forum website        No. of      Pages       No. of
                                                                             performance changes with number of candidate authors. To find
                                 pages      selected    selected
                                                                             out the effect of varying the number of classes on the
                                                          texts
                                                                             classification performance in the present study, the dataset was
  www.topix.com/forum/worl       31        2,8,13,2     425                  copied into three different files having all parameters being the
  d/nigeria                                5                                 same except the class labels. The class labels were controlled as
  www.topix.com/forum/worl       9         2,3,6.9      317                  presented in Table 3.
  d/ghana
  www.topix.com/forum/worl       4         1-4          130                 Table 3: Dataset Class Labelling Options
  d/liberia                                                                    File       No of      Class Labels             Remark
  www.topix.com/forum/worl       4         1-4          241                   Name        Class
  d/cameroon                                                                              Labels
  www.topix.com/forum/worl       4         1-4          357                  Dataset1 5             Nigeria,         Labelling according to
  d/sierra-leone                                                                                    Ghana,           texts’ original classes.
  Total no. of Texts                                    1,470                                       Cameroon,
                                                                                                    Liberia, Sierra-
3.2.1 Text Pre-processing and Processing                                                            Leone
The corpora were subjected to pre-processing in order to put them           Dataset 2 3             Nigeria,         Labelling informed by
in the format expected by the relevant software for text processing.                                Ghana, Non-      language similarities
The pre-processing tasks included deletion of e-mail headers,                                       Ghana-Nigeria between the selected
removal of control codes, text aggregation, and removal of non-                                                      countries as found in a
ASCII characters. Text processing was achieved by extracting                                                         previous study [21].
linguistic features from the sampled texts using computer codes             Dataset 3 2             Nigeria, Non-    Testing a 2-class
written by the researcher in Python 2.6.4 programming language,                                     Nigeria          labelling scheme which
based on the natural language toolkit (NLTK) version 2.0. Some of                                                    can enable the
the specific issues handled in the course of text processing were                                                    identification of online
tokenization, part of speech tagging and linguistic feature                                                          texts from a country
extraction.                                                                                                          from those of other
Although there is no agreement on a best set of features for a wide                                                  countries put together.
range of application domains, selected feature metrics must be



                                                                       98
The texts in Dataset 1 bear their original class labels, that is, the         and it is an engineering tool that is widely applied to numerous
actual countries of affiliation of the writers as determined from the         engineering problems for designing and testing all types of
forums and the texts. There are therefore five different class labels,        engineering and physical systems ([18]; [19]). The result of the dot
representing the five country sources of the texts. Dataset 2 has             products of the two measures is as presented in Appendix 2. The
three class labels; texts from Nigeria and Ghana bear their original          table in Appendix 2 presents the performances of our models
country source labels while those from the other three countries              taking into consideration the two performance measures. We
were combined and labelled 'Non-Ghana-Nigeria'. This was                      consider this table more representative of the models'
informed by a previous study that showed varying degrees of                   performances because it combines the strengths and weaknesses of
similarity in the English language usage among the selected                   the two performance measures. Answers to research questions will,
countries. Dataset 3 labelled texts from Nigeria as Nigeria while             therefore, be based on the content of this table.
texts from the other four countries were combined under the label '
Non-Nigeria'. This was done to achieve a two-class dataset option.            4. RESULTS AND DISCUSSION
                                                                              Research Question 1: Which feature set type maximizes the
Experiments were carried out using the Experimenter interface of              effectiveness of profiling the country of affiliation of writers of
the open source Waikato Environment for Knowledge Analysis                    online messages?
(WEKA) machine learning tool. In this study, four machine
learning algorithm implementations in WEKA namely naïve                       Figure 1 is a derivative of the table in Appendix 2, it shows the
Bayes, SMO (SVM implementation), J48 and Multilayer                           product of percent correct and kappa statistic values derived for the
perceptron (Neural network implementation) were used. The                     feature set types in our experiment. The results are presented
experiment was carried out to compare the performances classifier             successively for Naive Bayes, SMO, J48 and Neural Network.
models in the phase of:
               a. Changing the number of classes.
               b. Changing the linguistic feature sets.
               c. Changing classifier algorithms.

Each of the three datasets (Dataset 1, Dataset 2 and Dataset 3) with
each of the feature set types (F1, F2, F3, F4) and all their possible
combinations (F1+F2, F1+F2+F3, F1+F2+F3+F4, F1+F2+F4,
F1+F3, F1+F4, F2+F3, F2+F3+F4, F2+F4, F3+F4, F3+F4+F1)
were analysed using the four machine learning algorithms.

Ten fold cross validation was used to evaluate the models'
performances based on percent correct (percentage of all datasets                   Figure 1: Comparison of feature sets performances
that are classified correctly) and Kappa statistic (measure of the
agreement between predicted and observed categorization, while                Across all the three datasets, the feature set that combined all
correcting for agreement that happens by chance.                              feature types (F1+F2+F3+F4) performed best. This is followed by
                                                                              (F2+F4), (F2+F3+F4) and (F1+F2+F3), while the performance of
3.4 Evaluation of the Experiments                                             F1 was the least. Our result shows that inclusion of all features
Tables in Appendix 1 show the percent correct and kappa statistic             from all the four types (lexical, syntactic, idiosyncrasies and
values derived for each of the datasets in our experiment. The                content specific) produced the most effective model. Again the
results are presented successively for Naive Bayes, SMO, J48 and              result was consistent with those of [20] and[2] and [1] pg 365 who
multilayer perceptron. It could be observed from the tables that the          reported that combining feature types in their studies gave a better
percent correct values appear to be highest for Dataset 3 while               result. Using vocabulary richness only produced the poorest result
Kappa statistics appear to be highest for Dataset 2. This                     probably because of the short length of online messages in the
observation cuts across virtually all features sets and classifiers.          study.
This implies that classifiers were better able to classify Dataset 3          Research Question 2: Which classification scheme maximizes
correctly compared to other datasets while classifications achieved           the effectiveness of profiling the country of affiliation of
in Dataset 2 gave better agreement between predicted and observed             writers of online messages?
categorization having corrected for agreement that happened by                Figure 2 shows the relative performances of the four classifiers
chance. Worthy to be noted is the result of SMO in Dataset 3,                 across all feature types (F1+F2+F3+F4) and datasets.
although the percent correct values were relatively high, Kappa
statistics were all zero. Lack of coherence in the directions of the
two performance measures led us to using the product of the two
measures (percent correct and kappa statistic) as a basis for
comparing models' performances.
This decision to use the product was informed by the theory of
Dimensional Analysis which is a problem-solving method that uses
the fact that any number or expression can be multiplied by one
without changing its value. One can only meaningfully add or
subtract quantities of the same type but can multiply or divide
quantities of different types. When two measurements are
multiplied together the product is of a type depending on the types            Figure 2: Relative performances of the four classifiers across
of the measurements. This analysis is routinely applied in physics                               all feature and data sets.



                                                                         99
                                                                        classifying electronic messages to writers' countries of affiliation.
Neural Network (multilayer perceptron) performed best when              Separate two-class label file was created for each country, resulting
compared to the other three classifiers. Its performance was            in a dataset for each country, where all attributes except the class
particularly the highest on the feature set (F1+F2+F3+F4)               attribute were the same. The class attribute for a particular country
contained in our two-class option dataset (Dataset 3). Most             had instances labelled either as 'the country name' such as (Nigeria,
previous studies considered SVM most appropriate in authorship          Ghana, Cameroon) or as 'non country name' such as (Non-Nigeria,
attribution (though most times without carrying out a prior             Non-Ghana, Non-Cameroon). Tables 4 shows the effectiveness of
experiment). [1] however, reported that there were no significant       profiling authors' countries of affiliation by the resultant model.
performance differences between SVM and neural networks. It
could be observed that SVM implementation (SMO) outperformed                Table 4: Effectiveness of Profiling Authors' Countries of
the other three classifiers when the texts contained their natural                                  Affiliation
class labels (Dataset 1) and performed most terribly on Dataset 3.      Country           Percent           Kappa          PC*KS
This corroborates the submission of [15] that no single machine                           Correct           Statistics
learning scheme is appropriate to all data mining problems because      Nigeria           75.80             0.34           25.95
real datasets vary and to obtain accurate models, the bias of the       Cameroon          73.80             0.10           7.68
learning algorithm must match the structure of the domain.              Ghana             78.40             0.27           21.54
Meaning that the structure of our Dataset 3 is most amenable to         Liberia           88.20             0.04           3.23
neural network than any of the other machine learning schemes           Sierra Leone      70.80             0.28           19.59
(Naive Bayes, SMO, J48) in our study. Worthy of note also is the        PC*KS denotes Percent correct* Kappa statistics
usefulness of our application of the dimensional analysis principle
which informed the multiplication of the two performance                Application of our optimization method resulted in a remarkable
measures in our study. For example, if our comparison had been          improvement in the profiling of each country from the others. The
based on percent correct (in Appendix 1) only, we might have            study showed that we could achieve a percent correct ranging
erroneously rated the performance of SMO relatively high on             between 70.8% and 88.2% at Kappa statistics ranging between
Dataset 3.                                                              0.04 and 0.34 compared to the highest possible percent correct
                                                                        value of 43.8% at kappa statistics of 0.26% if our method was not
Research Question 3: Which class labelling option maximizes             applied. This however is a trade-off on the efficiency of the
the effectiveness of profiling the country of affiliation of            profiling process because we needed to create separate labels for
writers of online messages?                                             the class attribute. The extent of improvement in model
Fig. 3 shows the percent correct values derived for each of the         performance however can be said to outweigh the additional effort.
datasets in our experiment using the most precise classification        The detailed performance of the model is as shown in Table 5.
scheme (Neural Network) and all feature sets (F1+F2+F3+F4)
only. The results are presented successively for Naive Bayes,             Table 5: Detailed Prediction Performance of the Resultant
SMO, J48 and Neural Network.                                                                        Model
                                                                                              TP      FP     Preci-   Re-       F-    RO
                                                                                             Rate    Rate     sion    call    score    C
                                                                                                                                      Area
                                                                          Nigerian           0.380   0.080   0.671    0.380   0.485   0.72
                                                                                                                                       1
                                                                          Non-Nigerian       0.920   0.620   0.776    0.920   0.842   0.72
                                                                                                                                       1
                                                                          Weighted Average   0.758   0.458   0.744    0.758   0.735   0.72
                                                                                                                                       1

                                                                          Cameroon           0.299   0.182   0.230    0.299   0.260   0.65
                                                                                                                                       2
                                                                          Non-Cameroon       0.818   0.701   0.865    0.818   0.841   0.65
   Figure 3: Column Chart of Classifier Performances with                                                                              2
                                                                          Weighted Average   0.738   0.621   0.767    0.738   0.751   0.65
               Varied Class Labelling Options                                                                                          2

The figure shows that the dataset having two class options (Dataset        Ghanaian          0.333   0.092    0.5     0.333   0.400   0.67
3) performed best followed by the one having three class options                                                                       1
                                                                           Non-Ghanaian      0.908   0.667   0.832    0.908   0.868   0.67
(Dataset 2) and lastly the one having the instances labelled                                                                           1
naturally, having five classes (Dataset 1). The result is consistent      Weighted Average   0.784   0.543   0.760    0.784   0.767   0.67
with those of [3] and [1] that reported that authorship attribution                                                                    1
success improves with reduction in the number of authors or author
                                                                           Liberian          0.036   0.013   0.250    0.036   0.063   0.67
classes. In the specific however, the present result shows that if we
                                                                                                                                       1
can reduce an authorship profiling problem to a two-class one, we          Non-Liberian      0.987   0.964   0.892    0.987   0.937   0.67
can get an appreciable improvement in the effectiveness of                                                                             1
authorship profiling task.                                                Weighted Average   0.882   0.859   0.822    0.822   0.841   0.67
                                                                                                                                       1

Research Question 4: What is the performance of the resultant              Sierra-Leonean    0.582   0.256   0.39     0.582   0.467   0.74
model in classifying electronic messages to writers' countries of          (SL)                                                        8
affiliation?                                                               Non SL            0.744   0.418   0.863    0.744   0.799   0.74
Using the TrainTestSplitMaker component of WEKA's knowledge                                                                            8
                                                                          Weighted Average   0.708   0.383   0.759    0.708   0.726   0.74
flow interface to evaluate the performance of our model in                                                                             8




                                                                    100
The resultant model performed well when we consider the                   [5] Koppel, M., Schler, J., Argamon, S. 2009. Computational
weighted averages of the performance measures of each dataset. It               methods in authorship attribution. Journal of the American
could however, be observed that the model was better at                         Society for Information Science and Technology 60(1). 9-26.
identifying texts that were not from the country as against those         [6] Argamon, S., Koppel,M., Pennebaker, J.W. and Schler, J.
that were from the country in each case. It could also be observed              2009. Automatically Profiling the Author of an Anonymous
that the performance of the model in predicting each country's                  Text. Communications of the ACM 52(2). 119-123.
texts vary directly with the number of each country's texts in the        [7] De Vel, O. , Anderson, A., Corney, M. and Mohay, G. 2001.
study corpus. The best performance was achieved in profiling                    Mining E-mail Content for Author Identification Forensics.
Nigerian electronic texts from Non Nigeria texts, followed by that              Special Interest Group on Management of Data (ACM
of Sierra Leone and then Ghana. Thus, it could be deduced that                  SIGMOD) Record 30(4). 55-64.
performance of our model could be much improved with bigger                [8] Juola, P. 2007. Future trends in authorship attribution.
sub-corpora sizes.                                                              International Federation for Information Processing 24(2).
                                                                                119-132.
5. CONCLUSION                                                             [9] Iqbal, F., Hadjidj, R., Fung, B.C.M and Debbabi, M. 2008. A
The study through experiments sought the number of class options,               novel approach of mining write-prints for authorship
feature set types and machine learning scheme that maximize the                 attribution in e-mail forensics. 2008 Digital Forensic
effectiveness of identifying the countries of affiliation of authors of         Research Workshop. Elsevier Ltd. Retrieved Nov. 16, 2009,
online messages composed in English language. The online                        from www.elsevier.com/locate/diin. 2008.05.001
messages in our corpus were collected from online forums of five          [10] Holmes, D.I. 2003. Stylometry and the civil war: the case of
African countries with average length of 52 to 102 words. Using a               the Pickett letters. CHANCE 16(2) 18-25.
product of percent correct and kappa statistics as our bases for          [11] Binongo, J.N.G. 2003. Who wrote the 15th book of Oz? An
model justification, the experiment showed that we achieved the                 application of multivariate analysis to authorship attribution.
most effective model when all feature set types, contained in a                 CHANCE 16(2) . 9-17.
two-class dataset was analysed with the neural network (multilayer        [12] Binongo, J.N.G. and Smith M.W.A. 1999. The application of
perceptron) machine learning scheme.            Application of the              principal component analysis to stylometry. Literary and
parameters of the most effective model (derived from the                        Linguistic Computing 14(4). 445-466.
experiment) to profiling the countries of affiliation of authors of       [13] Abbasi, A. and Chen, H. 2006. Visualizing authorship for
the online messages resulted in about a hundred percent                         identification. Lecture Notes in Computer Science (LNCS)
improvement in effectiveness.                                                   3975. Eds. Mehrotra, S., Zeng, D.D., Chen, H.,
                                                                                Thuraisingham, B., Wang, F. Berlin: Springer-Verlag. 60–
The study achieved greater effectiveness but with a trade-off on                71.
efficiency. We look forward to having a model that can maximize           [14] Abbasi, A. and Chen, H. 2008. Writeprints: a stylometric
both effectiveness and efficiency in profiling the authorship of                approach to identity level identification and similarity
online messages, and this constitutes a need for further studies.               detection in cyberspace. ACM Transactions on Information
This approach in its present state can be very appropriate if a group           Systems. 26 (2). doi: 10.1145/1344411.1344413.
is suspected and the purpose of authorship attribution is to affirm       [15] Witten, I.H. and Frank, E. 2005. Data mining: practical
one's thought about the suspect's group of affiliation.                         machine learning tools and techniques. 2nd ed. USA: Morgan
                                                                                Kaufmann publishers.
                                                                          [16] Kujore, O. 1985. English usage: some notable Nigerian
6. REFERENCES                                                                   variations. 1-112. Nigeria: Evans Brothers Nigeria
[1] Zheng, R., Li, J., Chen, H. and Huang, Z. 2006. A framework                 Publishers Limited.
      for authorship identification of online messages: writing-          [17] Jowitt, D. 1991. Nigerian English usage: An Introduction. 1-
      style features and classification techniques. Journal of the              277.            Nigeria: Longman.
      American Society for Information Science and Technology,            [18] Balaguer P 2013 Application of Dimensional Analysis in
      57(3). 378–393.                                                           Systems        Modeling     and    Control    Design,      The
[2] Koppel, M., Schler, J. and Zigdon, K. 2005. Automatically                        Institution of Engineering and Technology;
      determining an anonymous author’s native language.                  [19] Szirtes T 2007 Applied Dimensional Analysis and Modeling.
      Lecture Notes in Computer Science (LNCS) 3495. Eds.                       Elsevier/Butterworth-Heinemann Amsterdam;              New
      Kantor, P.B., Muresan, G., Roberts, F., Zeng, D.D. and                    York.
      Wang, F : ISI 2005, Berlin: Springer-Verlag. 209 – 217.             [20] Ma, J., Teng, G., Zhang, Y., Li, y. and Li Y (2009) A
[3] Luyckx, K. and Daelemans, W. 2008. Authorship attribution                   Cybercrime Forensic Method for Chinese Web Information
      and verification with many authors and limited data. In:                  Authorship Analysis. In: PAISI 2009, LNCS 5477 pp. 14-
      Proceedings of the 22nd International Conference on                       24. H. Chen et al. (Eds.). Springer-Verlag Berlin Heidelberg
      Computational Linguistics held in Manchester from 18-22             [21] Opesade, A., Adegbola, T., & Tiamiyu, M. (2013).
      August 2008. 513–520.                                                     Comparative Analysis of Idiosyncrasy, Content and
[4] Koppel , M., Schler , J., Argamon, S. and Messeri, E. 2006.                 Function Word Distributions in the English Language
      Authorship attribution with thousands of candidate authors.               Variants of Selected African Countries. International Journal
      In: Proceedings of the 29th annual international ACM SIGIR                of Computational Linguistics Research Vol. 4(3) pp.130-
      (Special Interest Group on Information Retrieval)                         143.
      conference on research and development in information
      retrieval. Aug. 6-11 2006, Seattle, Washington, USA.




                                                                      101
Appendix 1: Experiment Result



                                                       Naive Bayes                                      SMO

                                Dataset 1          Dataset 2          Dataset 3           Dataset 1      Dataset 2       Dataset 3

              Feature         PC       KS         PC         KS       PC       KS        PC     KS      PC       KS     PC      KS
             F1+F2+F3
             +F4              34.96     0.18     58.49       0.33    67.34     0.31     43.65    0.25   62.12    0.34   71.09   0.00
             FI               31.87     0.06     50.14       0.09    71.09     0.00     31.52    0.05   49.52    0.00   71.09   0.00
             F1+F2            36.29     0.20     60.11       0.34    69.41     0.29     42.53    0.24   58.54    0.26   71.09   0.00
             F1+F2+F3         36.77     0.20     60.31       0.35    69.78     0.32     42.48    0.24   60.33    0.30   71.09   0.00
             F1+F2+F4         34.80     0.18     58.39       0.33    66.97     0.30     42.84    0.24   60.54    0.30   71.09   0.00
             F1+F3            32.37     0.11     52.63       0.21    69.77     0.15     32.73    0.07   49.48    0.00   71.09   0.00
             F1+F4            32.10     0.15     55.01       0.28    65.59     0.27     34.07    0.10   49.86    0.06   71.09   0.00
             F2               36.06     0.20     59.77       0.33    70.74     0.30     41.83    0.23   58.48    0.26   71.09   0.00
             F2+F3            36.69     0.20     60.51       0.35    70.63     0.31     42.32    0.23   59.57    0.28   71.09   0.00
             F2+F3+F4         34.64     0.18     58.86       0.33    67.93     0.31     43.76    0.26   61.72    0.33   71.09   0.00
             F2+F4            34.65     0.18     58.73       0.33    67.46     0.30     42.50    0.24   59.89    0.29   71.09   0.00
             F3               31.79     0.10     53.24       0.18    71.46     0.09     31.86    0.06   49.50    0.00   71.09   0.00
             F3+F4            32.43     0.15     55.97       0.29    66.14     0.27     35.05    0.12   51.58    0.10   71.09   0.00
             F3+F4+F1         32.64     0.15     55.44       0.28    65.23     0.26     34.50    0.11   52.17    0.11   71.09   0.00
              F4              31.53     0.14     55.37   0.28       65.93      0.26     34.88    0.12   49.75    0.06   71.09   0.00
PC = Percent Correct                            KS = Kappa Statistic

Experiment Result Continued
                                                Tree (J48)                          Multilayer Perceptron (Neural Network)

             Feature           Dataset 1        Dataset 2        Dataset 3         Dataset 1      Dataset 2      Dataset 3
         F1+F2+F3+F4           PC     KS       PC       KS     PC       KS         PC     KS      PC      KS     PC     KS
         FI                   35.11 0.13       49.92    0.15 71.09      0.00      31.76 0.07      50.01 0.13    71.09 0.00
         F1+F2                38.32 0.20       55.59    0.28 72.10      0.24      40.28 0.22      60.31 0.33    72.73 0.30
         F1+F2+F3             37.66 0.20       55.05    0.28 71.74      0.24      40.16 0.22      61.05 0.35    74.16 0.33
         F1+F2+F4             37.88 0.20       55.65    0.29 70.80      0.24      41.22 0.23      60.32 0.34    72.91 0.32
         F1+F3                31.58 0.11       51.93    0.20 72.37      0.09      32.73 0.10      52.52 0.19    70.39 0.03
         F1+F4                34.61 0.15       55.34    0.28 70.43      0.05      38.69 0.18      57.62 0.29    69.86 0.16
         F2                   37.87 0.20       55.57    0.28 72.20      0.26      40.18 0.21      59.50 0.32    71.90 0.28
         F2+F3                36.97 0.19       55.41    0.28 71.63      0.25      41.02 0.23      61.19 0.35    73.98 0.33
         F2+F3+F4             37.76 0.20       56.08    0.30 71.11      0.25      41.22 0.24      60.37 0.34    73.81 0.33
         F2+F4                37.84 0.20       56.18    0.30 71.14      0.25      40.60 0.22      60.30 0.34    72.50 0.30
         F3                   29.48 0.07       52.54    0.17 70.59      0.04      31.64 0.08      52.90 0.17    70.44 0.04
         F3+F4                35.49 0.17       54.41    0.27 69.78      0.17      38.06 0.18      57.67 0.30    70.46 0.19
         F3+F4+F1             34.71 0.16       54.12    0.27 69.94      0.18      38.48 0.19      57.69 0.30    70.12 0.20
         F4                   35.41 0.16       55.06    0.27 70.09      0.02      38.46 0.18      57.03 0.29    70.49 0.12
PC = Percent Correct                             KS = Kappa Statistic




                                                                    102
Appendix 2: Products of Percent Correct and Kappa Statistics
                        Naive Bayes                      SMO                         J48                  Multilayer Perceptron
                         (PC*KS)                       (PC*KS)                     (PC*KS)                      (PC*KS)
                  Datase    Datas     Datas   Datase    Datas    Datase   Datase   Dataset   Dataset   Dataset   Dataset    Datase
        Feature    t1        et 2      et 3    t1        et 2     t3       t1        2         3         1         2         t3
       F1+F2+F
       3+F4          6.29   19.30     20.88    10.91    21.12      0.00     7.55     16.17     17.77      9.95     21.29     25.22
       FI            1.91    4.51      0.00     1.58     0.00      0.00     4.56      7.49      0.00      2.22      6.50      0.00
       F1+F2         7.26   20.44     20.13    10.20    15.22      0.00     7.66     15.57     17.30      8.86     19.90     21.82
       F1+F2+F
       3             7.35   21.11     22.33    10.20    18.10      0.00     7.53     15.41     17.22      8.84     21.37     24.47
       F1+F2+F
       4             6.26   19.27     20.09    10.28    18.16      0.00     7.58     16.14     16.99      9.48     20.51     23.33
       F1+F3         3.56   11.05     10.47     2.29     0.00      0.00     3.47     10.39      6.51      3.27      9.98      2.11
       F1+F4         4.82   15.40     17.71     3.41     2.99      0.00     5.19     15.50      3.52      6.96     16.71     11.18
       F2            7.21   19.72     21.22     9.62    15.20      0.00     7.57     15.56     18.77      8.44     19.04     20.13
       F2+F3         7.34   21.18     21.90     9.73    16.68      0.00     7.02     15.51     17.91      9.43     21.42     24.41
       F2+F3+F
       4             6.24   19.42     21.06    11.38    20.37      0.00     7.55     16.82     17.78      9.89     20.53     24.36
       F2+F4         6.24   19.38     20.24    10.20    17.37      0.00     7.57     16.85     17.79      8.93     20.50     21.75
       F3            3.18    9.58      6.43     1.91     0.00      0.00     2.06      8.93      2.82      2.53      8.99      2.82
       F3+F4         4.86   16.23     17.86     4.21     5.16      0.00     6.03     14.69     11.86      6.85     17.30     13.39
       F3+F4+F
       1             4.90   15.52     16.96     3.80     5.74      0.00     5.55     14.61     12.59      7.31     17.31     14.02
     F4            4.41 15.50     17.14        4.19      2.99      0.00     5.67     14.87      1.40      6.92     16.54      8.46
PC*KS denotes Percent correct* Kappa statistic




                                                                 103