A Slightly-modified GI-based Author-verifier with Lots of Features (ASGALF) Notebook for PAN at CLEF 2014 Mahmoud Khonji and Youssef Iraqi Khalifa University {mahmoud.khonji, youssef.iraqi}@ku.ac.ae Abstract This paper presents the performance evaluation of an authorship ver- ification technique that is based on a modified version of General Impostors (GI) [2]. The novelties of this implementation are: 1. a modified way of com- bining the min-max similarity measure and, 2. a relatively large set of diverse features that spans letter-level, word-level, function word-level, word shape-level, and word tag-level features. The technique ranked high in overall in the author identification task of PAN 2014. 1 Introduction This paper presents the implementation and evaluation of our classifier for the author identification task in PAN at CLEF 2014. PAN organizers provided us with a collection of problems P = {Pi : ∀ i ∈ I} where I = {1, 2, . . . , n} is the index set of P. Each problem Pi ⊂ F , for all i ∈ I, where F = {fj : ∀ j ∈ J} is a set of Unicode text files and J = {1, 2, . . . m} is the index set of F . Additionally, each Pi ∈ P has the following properties: – All fj ∈ Pi are written in only one of the following language-genre combinations: Dutch essays, Dutch reviews, English essays, English novels, Greek articles and Spanish articles. – Only one text file fu ∈ Pi , for some u ∈ J, is written by an author whose identity is unknown, and the rest of the files Pi − {fu } are written by authors whose identities are known. – The cardinality |Pi − {fu }| is allowed to vary from 1 to 5 inclusive as i varies, but (obviously) fixed for a given i. Thus the goal of the authorship identification problem is to find a mapping function h : P → Y that solves the optimization problem (1), where Y = [0, 1] with the semantics of (2). arg max AUC({h(Pi ) : ∀ Pi ∈ P}) × C@1({h(Pi ) : ∀ Pi ∈ P}) (1) h∈H , where H is the set of all possible mapping functions, AUC is the area under the Re- ceiver Operating Characteristic (ROC) curve, and C@1 is essentially a modified accu- racy measure that rewards models h ∈ H that know when not to decide as presented in (3). 977  y ∈ [0, 0.5) if h predicts authors of fu and Pi − {fu } are different  h(Pi ) = y = 0.5 if h does not predict (2)  y ∈ (0.5, 1] if h predicts authors of fu and Pi − {fu } are same  Nc N|P| c Nu C@1 = (3) |P| , where |P| is the total number of problems, Nc is the total number of problems that the model h has solved correctly, and Nu is the total number of problems that the model h has decided to not solve. Additionally, and in order to enhance the quality of the found models h ∈ H, the organizers of PAN have also provided us with another problems collection L that has an identical structure to the collection P. L is intended to be used as a training set to find better h ∈ H models and thus PAN has also provided its ground truth information as a form of function G : L → {same authors, different authors}. 2 Method description Our classifier, namely A Slightly-modified GI-based Author-verifier with Lots of Fea- tures (ASGALF), is (as the name implies) based on a modified version of the General Impostors (GI) framework [2]. It differs compared to other GI implementations in a couple of ways. First, we discuss why we have chosen to use GI as the starting point for ASGALF, then we discuss the differences between the GI implementation [3] and ASGALF. 2.1 Reasons for adopting the GI framework Based on our preliminary tests, previous PAN results, and our intuitive reasoning, it is very clear to us that the set of impostors (or distractors) does contain some information that helps in answering the question of how similar two vectors are, and considering such information in the decision making process is indeed helpful to solve (1) in a better way than otherwise. For example, consider two feature vectors v1 = k(f1 ) and v2 = k(f2 ), where k : Pi → Rd is a function that extracts the features from any input text file fj ∈ Pi , for any j ∈ J, and returns a features vector vj ∈ Rd that corresponds to fj . Simply knowing the value of q(v1 , v2 ) is not enough in reality as far as we know, where q : Rd × Rd → Y is a function that returns the similarity of its input vectors. Our justification to this is that the output of all q functions that are known to-date are somewhat relevant to the topic/context of the evaluated vectors. For example, in some topic/context authors in general tend to be similar, while in some other topic/context authors tend to be dissimilar. E.g. in some context q(v1 , v2 ) = y indicates that v1 and v2 represent documents authored by the same individual only if y is greater than (say) 0.9. This can be the case if 978 the topic/context is a restrictive one that causes authors to be very similar to each other (e.g. reports). On the other hand, some other topic/context may allow high variability among authors, thus vectors v1 and v2 can be assumed to represent documents that are authored by the same individual even if q < 0.9, such as q = 0.4 can possibly indicate that the authors are the same. Thus fetching a set of impostor text files M = {mw : ∀ w ∈ W }, where W = {1, 2, . . . , n} is the index set of M , is — in our view — a solid step towards a better optimization of the problem (1). We believe that measuring the distance against impostor vectors {xw = k(mw ) : ∀w ∈ W }, allows the model to see how close v1 and v2 are to each other relative to other impostor vectors in the same topic. Additionally, the GI framework is one that is based upon ensembling randomized models. Although it might not be very obvious at the first glance, we believe that GI essentially creates a set of randomized models in every run (by choosing different fea- tures and impostors subsets in every run). Relying on ensembles of models is another strength of the GI framework that is appreciated by the Machine Learning community as well as other competitions such as the Netflix Prize1 where most of the top techniques were composed of some form of ensembles. 2.2 Differences between ASGALF and [3] – Instead of using the original Impostors score measure (4), we adopted a modified one as presented in (5). The advantages of (5) is that it allows us to measure how similar input vectors are as opposed to whether they are similar enough. On the other hand, a possible disadvantage could be over fitting the collection of training problems L. score ← score + 1 if min-max(v1 , v2 )2 > (4) min-max(v1 , x1 ) × min-max(v2 , x2 ) , where x1 is the most similar impostor to v1 and x2 is the most similar impostor to v2 . min-max(v1 , v2 )2 score ← score + (5) min-max(v1 , x1 ) × min-max(v2 , x2 ) – Using a large set of diverse features. Essentially, we have extracted letter-level, word-level, function word-level, word shape-level, part-of-speech tag-level features as follows: • n-grams with various combinations of n values and gram types — n values are {1, 2, .., 10}, gram types are {letters, words, function words, word shapes2 , 1 http://en.wikipedia.org/wiki/Netflix_Prize 2 The word shapes are based on three properties: characters case (e.g. lower/upper case), char- acters type (whether it is a letter or a number), and words length. For example, the word “School” is represented as the gram “Cccccc”, “2014” is represented as the gram “NNNN”, “x86” is represented as the gram “cNN”, etc. 979 POS tags3 , POS-words4 } and the resultant features were based on the com- binatorics of all n values and gram types. This resulted in a large number of features, most of which were too infrequent to be reliable, thus we have only considered features that occurred for at least 5 times in any single document. However, the number of features remained large in general even after remov- ing the infrequent features. The total number of extracted features after remov- ing the infrequent ones varied depending on language-genre combinations as shown in Table 1. Table 1. Total number of extracted features on per language-genre basis. Language Genre Features extracted Dutch Essays 19,883 Dutch Reviews 3,312 English Essays 70,738 English Novels 43,931 Greek Articles 72,944 Spanish Articles 39,099 • Body richness — the total number of unique words in a given text file fj , for any j ∈ J, normalized by total number of words in the same file fj . Other details of ASGALF that are not mentioned in this notebook are the same as suggested in [3]. For example, similar to [3] we fetch the set of impostors of a given doc- ument from a search engine by submitting search queries of 5 randomly chosen words from the subject document, download the 10-highest ranked HTML pages, strip them from any HTML markup, and use their first 1500 words. In ASGALF, we fetched the first 1500 words of 10 impostors for each document in the training set, and then grouped them on per language-genre basis. The output of this process is a set of impostors for each language-genre combination. 3 Parameter tuning The parameters were tuned based on our preliminary tests against problems in L as follows: – Score correction offsets are: -0.4585, -0.62950, -0.43850, -0.24850, -0.478, and - 0.56600 for English essays, English novels, Dutch essays, Dutch reviews, Spanish 3 Part of speech tags such as NN, NNS, NNP, VB, VBD, VBG, etc. For example, if the word “school” existed in a text as a noun, then it would be represented as the gram “NN”. A compre- hensive list of such tags can be found in https://www.ling.upenn.edu/courses/ Fall_2003/ling001/penn_treebank_pos.html. 4 Combinations of words and their respective POS tags. For example, if the word “saw” was a noun then it would be represented as the gram “saw-NN”, and if the word was a verb then it would be represented as the gram “saw-VBD”. 980 articles, and Greek articles respectively. This allowed the final score to be centered around 0.5 in order to satisfy the semantics in (2). – Total number of Impostor rounds: 50, which also matches the optimal value for the Spanish set in [1]. – Total number of impostor documents: 8,614, 1,257, 2,728, 2,073, 5,347 and 3,104 for English essays, English novels, Dutch essays, Dutch reviews, Spanish articles and Greek articles respectively. – Total number of randomly chosen impostors per round: 20. – Total number of randomly chosen features per round: 40% of the total features, which also matches the optimal value for the English and Spanish sets in [3]. 4 Evaluation results Evaluation results are presented in Table 2. Table 2. PAN14 evaluation results. Note that Testing 1 is the early-bird evaluation dataset, and Testing 2 is the one that is used by PAN’s rankings Language Genre Eval. set AUC C@1 AUC × C@1 Runtime Dutch Essays Training 0.94076 0.875 0.82316 00:51:38 Testing 1 00:30:22 Testing 2 0.91254 0.84375 0.76996 00:58:21 Dutch Reviews Training 0.7258 0.67 0.48629 00:12:40 Testing 1 00:06:05 Testing 2 0.7364 0.65 0.47866 00:12:24 English Essays Training 0.5285 0.545 0.28803 09:29:28 Testing 1 04:46:23 Testing 2 0.59875 0.5829 0.34901 09:10:01 English Novels Training 0.9716 0.87 0.84529 01:02:03 Testing 1 01:09:46 Testing 2 0.75 0.61 0.4575 02:06:16 Greek Articles Training 0.5596 0.5959 0.33347 03:43:16 Testing 1 01:47:47 Testing 2 0.8892 0.81 0.72025 03:41:48 Spanish Articles Training 0.8992 0.87 0.7823 04:54:45 Testing 1 02:41:57 Testing 2 0.898 0.7777 0.69837 04:50:49 5 Notable limitations We believe that the score of the model can improve if the following limitations are addressed: 981 – Our classifier is implemented such that it always attempts to predict an answer. I.e. it never outputs a score y = 0.5, thus limiting its ability in taking advantage of the metric C@1. – The parameters were tuned in a preliminary testing phase against the training datasets. The outcome was that all language-genre combinations had similar optimal param- eter values. However, a more rigorous optimization process could have revealed better language-genre-specific parameter values. – At the core of Impostors as described in [3] is the min-max similarity measure, which is also the measure that we have used as an implementation of the q function. It is possible that a more sophisticated model could have found a better use of our diverse set of features. – No feature subset selection was performed other than removing features that did not occur frequently enough according to a simplistic criteria as described in previous sections. Using more sound feature subset selection methods (e.g. IG, Wraper, etc) can possibly reduce CPU time requirements as well as enhance the accuracy. 6 Conclusions This paper describes an authorship verification classifier that is based on the General Impostors (GI) framework with the exception of using a modified method of com- bining the scores, as well as a diverse set of features. The features were of various types, namely: letter-level, word-level, function word-level, word shape-level, and part- of-speech tag-level. The evaluation on the testing set shown high classification accuracy in general. This also confirms that impostor/distractor-based methodologies are indeed a step forward. Although the selection of the impostors/distractors set is a known limitation, it seems that it is practically not a major issue and that such set of impostors can be obtained with relative ease (e.g. such as by using search engines as in [3]). On the downside, our technique generally had a slow runtime. However, we believe that this issue can be minimized in a number of ways, such as: – Our code made excessive use of automatic execution of external commands via the shell. Some of such external commands were themselves very slow in general, such as the software used to extract the part-of-speech tags. If such external dependency is replaced by some faster one (or completely removed), the speed of the feature extraction phase can improve dramatically. – We believe that techniques that are based on the GI framework can be easily dis- tributed should multiple cores or machines be available. This is due to the fact that all GI rounds are independent from each other, thus allowing the runs to be dis- tributed across multiple cores or even different machines, ultimately leading in a much reduced runtime. Acknowledgement Thanks to Shachar Seidman for answering our questions about General Impostors (GI). 982 References 1. Joula, P., Stamatatos, E.: Overfiew of the Author Identification Task at PAN 2013. In: Conference and Labs of the Evaluation Forum (2013) 2. Koppel, M., Seidman, S.: Automatically Identifying Pseudepigraphic Texts. In: EMNLP. pp. 1449–1454. ACL (2013) 3. Seidman, S.: Authorship Verification Using the Impostors Methods. In: Conference and Labs of the Evaluation Forum (2013) 983