1 Introduction

A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF)

0 Khalifa University

2014

977 983

This paper presents the performance evaluation of an authorship verification technique that is based on a modified version of General Impostors (GI) [2]. The novelties of this implementation are: 1. a modified way of combining the min-max similarity measure and, 2. a relatively large set of diverse features that spans letter-level, word-level, function word-level, word shape-level, and word tag-level features. The technique ranked high in overall in the author identification task of PAN 2014.

1 Introduction

8>y 2 [0; 0:5) if h predicts authors of fu and Pi <

y = 0:5 if h does not predict >:y 2 (0:5; 1] if h predicts authors of fu and Pi ffug are different ffug are same

Nc NjcPNju jPj , where jPj is the total number of problems, Nc is the total number of problems that the model h has solved correctly, and Nu is the total number of problems that the model h has decided to not solve.

Additionally, and in order to enhance the quality of the found models h 2 H, the organizers of PAN have also provided us with another problems collection L that has an identical structure to the collection P. L is intended to be used as a training set to find better h 2 H models and thus PAN has also provided its ground truth information as a form of function G : L ! fsame authors; different authorsg. (2) (3) 2

Method description

Our classifier, namely A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF), is (as the name implies) based on a modified version of the General Impostors (GI) framework [ 2 ]. It differs compared to other GI implementations in a couple of ways. First, we discuss why we have chosen to use GI as the starting point for ASGALF, then we discuss the differences between the GI implementation [ 3 ] and ASGALF. 2.1

Reasons for adopting the GI framework Based on our preliminary tests, previous PAN results, and our intuitive reasoning, it is very clear to us that the set of impostors (or distractors) does contain some information that helps in answering the question of how similar two vectors are, and considering such information in the decision making process is indeed helpful to solve (1) in a better way than otherwise.

For example, consider two feature vectors v1 = k(f1) and v2 = k(f2), where k : Pi ! Rd is a function that extracts the features from any input text file fj 2 Pi, for any j 2 J , and returns a features vector vj 2 Rd that corresponds to fj .

Simply knowing the value of q(v1; v2) is not enough in reality as far as we know, where q : Rd Rd ! Y is a function that returns the similarity of its input vectors.

Our justification to this is that the output of all q functions that are known to-date are somewhat relevant to the topic/context of the evaluated vectors. For example, in some topic/context authors in general tend to be similar, while in some other topic/context authors tend to be dissimilar.

E.g. in some context q(v1; v2) = y indicates that v1 and v2 represent documents authored by the same individual only if y is greater than (say) 0.9. This can be the case if the topic/context is a restrictive one that causes authors to be very similar to each other (e.g. reports). On the other hand, some other topic/context may allow high variability among authors, thus vectors v1 and v2 can be assumed to represent documents that are authored by the same individual even if q < 0:9, such as q = 0:4 can possibly indicate that the authors are the same.

Thus fetching a set of impostor text files M = fmw : 8 w 2 W g, where W = f1; 2; : : : ; ng is the index set of M , is — in our view — a solid step towards a better optimization of the problem (1).

We believe that measuring the distance against impostor vectors fxw = k(mw) : 8w 2 W g, allows the model to see how close v1 and v2 are to each other relative to other impostor vectors in the same topic.

Additionally, the GI framework is one that is based upon ensembling randomized models. Although it might not be very obvious at the first glance, we believe that GI essentially creates a set of randomized models in every run (by choosing different features and impostors subsets in every run). Relying on ensembles of models is another strength of the GI framework that is appreciated by the Machine Learning community as well as other competitions such as the Netflix Prize1 where most of the top techniques were composed of some form of ensembles. 2.2

Differences between ASGALF and [ 3 ] – Instead of using the original Impostors score measure (4), we adopted a modified one as presented in (5). The advantages of (5) is that it allows us to measure how similar input vectors are as opposed to whether they are similar enough. On the other hand, a possible disadvantage could be over fitting the collection of training problems L. (4) (5) score score + 1 if min-max(v1; v2)2 > min-max(v1; x1) min-max(v2; x2) , where x1 is the most similar impostor to v1 and x2 is the most similar impostor to v2.

score score +

min-max(v1; v2)2 min-max(v1; x1) min-max(v2; x2) – Using a large set of diverse features. Essentially, we have extracted letter-level, word-level, function word-level, word shape-level, part-of-speech tag-level features as follows: n-grams with various combinations of n values and gram types — n values are f1; 2; ::; 10g, gram types are fletters, words, function words, word shapes2, 1 http://en.wikipedia.org/wiki/Netflix_Prize 2 The word shapes are based on three properties: characters case (e.g. lower/upper case), characters type (whether it is a letter or a number), and words length. For example, the word “School” is represented as the gram “Cccccc”, “2014” is represented as the gram “NNNN”, “x86” is represented as the gram “cNN”, etc. POS tags3, POS-words4g and the resultant features were based on the combinatorics of all n values and gram types. This resulted in a large number of features, most of which were too infrequent to be reliable, thus we have only considered features that occurred for at least 5 times in any single document. However, the number of features remained large in general even after removing the infrequent features. The total number of extracted features after removing the infrequent ones varied depending on language-genre combinations as shown in Table 1. Body richness — the total number of unique words in a given text file fj , for any j 2 J , normalized by total number of words in the same file fj .

Other details of ASGALF that are not mentioned in this notebook are the same as suggested in [ 3 ]. For example, similar to [ 3 ] we fetch the set of impostors of a given document from a search engine by submitting search queries of 5 randomly chosen words from the subject document, download the 10-highest ranked HTML pages, strip them from any HTML markup, and use their first 1500 words. In ASGALF, we fetched the first 1500 words of 10 impostors for each document in the training set, and then grouped them on per language-genre basis. The output of this process is a set of impostors for each language-genre combination. 3

Parameter tuning

The parameters were tuned based on our preliminary tests against problems in L as follows: – Score correction offsets are: -0.4585, -0.62950, -0.43850, -0.24850, -0.478, and 0.56600 for English essays, English novels, Dutch essays, Dutch reviews, Spanish 3 Part of speech tags such as NN, NNS, NNP, VB, VBD, VBG, etc. For example, if the word “school” existed in a text as a noun, then it would be represented as the gram “NN”. A comprehensive list of such tags can be found in https://www.ling.upenn.edu/courses/ Fall_2003/ling001/penn_treebank_pos.html. 4 Combinations of words and their respective POS tags. For example, if the word “saw” was a noun then it would be represented as the gram “saw-NN”, and if the word was a verb then it would be represented as the gram “saw-VBD”. articles, and Greek articles respectively. This allowed the final score to be centered around 0.5 in order to satisfy the semantics in (2). – Total number of Impostor rounds: 50, which also matches the optimal value for the

Spanish set in [ 1 ]. – Total number of impostor documents: 8,614, 1,257, 2,728, 2,073, 5,347 and 3,104 for English essays, English novels, Dutch essays, Dutch reviews, Spanish articles and Greek articles respectively. – Total number of randomly chosen impostors per round: 20. – Total number of randomly chosen features per round: 40% of the total features, which also matches the optimal value for the English and Spanish sets in [ 3 ]. 4

Evaluation results

Evaluation results are presented in Table 2. We believe that the score of the model can improve if the following limitations are addressed: – Our classifier is implemented such that it always attempts to predict an answer. I.e. it never outputs a score y = 0:5, thus limiting its ability in taking advantage of the metric C@1. – The parameters were tuned in a preliminary testing phase against the training datasets.

The outcome was that all language-genre combinations had similar optimal parameter values. However, a more rigorous optimization process could have revealed better language-genre-specific parameter values. – At the core of Impostors as described in [ 3 ] is the min-max similarity measure, which is also the measure that we have used as an implementation of the q function. It is possible that a more sophisticated model could have found a better use of our diverse set of features. – No feature subset selection was performed other than removing features that did not occur frequently enough according to a simplistic criteria as described in previous sections. Using more sound feature subset selection methods (e.g. IG, Wraper, etc) can possibly reduce CPU time requirements as well as enhance the accuracy. 6

Conclusions

This paper describes an authorship verification classifier that is based on the General Impostors (GI) framework with the exception of using a modified method of combining the scores, as well as a diverse set of features. The features were of various types, namely: letter-level, word-level, function word-level, word shape-level, and partof-speech tag-level.

The evaluation on the testing set shown high classification accuracy in general. This also confirms that impostor/distractor-based methodologies are indeed a step forward. Although the selection of the impostors/distractors set is a known limitation, it seems that it is practically not a major issue and that such set of impostors can be obtained with relative ease (e.g. such as by using search engines as in [ 3 ]).

On the downside, our technique generally had a slow runtime. However, we believe that this issue can be minimized in a number of ways, such as: – Our code made excessive use of automatic execution of external commands via the shell. Some of such external commands were themselves very slow in general, such as the software used to extract the part-of-speech tags. If such external dependency is replaced by some faster one (or completely removed), the speed of the feature extraction phase can improve dramatically. – We believe that techniques that are based on the GI framework can be easily distributed should multiple cores or machines be available. This is due to the fact that all GI rounds are independent from each other, thus allowing the runs to be distributed across multiple cores or even different machines, ultimately leading in a much reduced runtime.

Acknowledgement

Thanks to Shachar Seidman for answering our questions about General Impostors (GI).

1. Joula , P. , Stamatatos , E.: Overfiew of the Author Identification Task at PAN 2013 . In: Conference and Labs of the Evaluation Forum ( 2013 )

2. Koppel , M. , Seidman , S. : Automatically Identifying Pseudepigraphic Texts. In: EMNLP. pp. 1449 - 1454 . ACL ( 2013 )

3. Seidman , S. : Authorship Verification Using the Impostors Methods . In: Conference and Labs of the Evaluation Forum ( 2013 )