Grammar Checker Features for Author Identification and Author Profiling Notebook for PAN at CLEF 2013 Roman Kern Know-Center rkern@know-center.at Abstract Our work on author identification and author profiling is based on the question: Can the number and the types of grammatical errors serve as indica- tors for a specific author or a group of people? In order to detect the grammatical errors we base our approach on the output of the open-source library Language- Tool. In the case of the author identification we transform the problem into a statistical test, where an unknown document is written by another author when the distribution of grammatical errors deviated from documents of a reference corpus. For author profiling we implemented an instance based classification ap- proach, namely a k-NN classifier, in combination with a Language Model where a text is assigned to a specific age or gender group where the according reference corpus contains the closest match. In the evaluation we found that for both sce- narios grammatical errors do perform better than the baseline and do capture an aspect of a writing style, which is not contained in more traditional features, like stylometric features or word n-grams. 1 Introduction The task of author identification and author profiling can be seen as similar problems. Author identification is the task to find out whether a previously unseen text document has been authored by the same person as a number of reference documents. Therefore the problem can be reformulated to: Does a given text match a specific writing style of a single person. In the case of author profiling one tries to infer certain characteristics of an author from given piece of text. Again the problem can be phrased as: Does a given text match a specific writing style of a group of people. A overview of the tasks in the context of the PAN 2013 is given in [4]. In both cases one can assume that in the general case the content of the text cannot be seen as a reliable indicator for a match. An overview of stylometric features and main approaches is given in [5]. Using lexical errors and syntactic errors for authorship identification has already been proposed in the past [3]. The authors state that this ap- proach is similar to some extend to the way how humans assess the authorship of text document. One downside of such a approach is that tools to detect those writing errors do not deliver the necessary performance and heavy post-processing seems unavoid- able. We follow the same intuition for our approach and study the effectiveness of a contemporary grammar checking tool for authorship identification and profiling. Figure 1. Example for a short snippet of text which contains 2 errors according to the Language- Tool. For the second annotated location, LanguageTool suggests: “Consider using a past principle here: ’machined”’. 2 Approach The central component of our authorship identification and profiling system is a com- ponent to detect grammatical errors within text. Here we employ the open-source tool LanguageTool1 , which is a style and grammar checker. It works for 20 different lan- guages and can be easily be extended to include additional rules. To illustrate the out- put of the LanguageTool library an example is depicted in figure 1, where two different types of errors are detected, where the example is directly taken from the PAN 2013 authorship identification data-set. Additionally to the feature generated out of the Lan- guageTool grammar checker, we integrated more traditional stylometric features into our system. Author Identification The task of author identification is transformed into statistical test, where the input is a set of reference documents from a single author and an unknown document. The documents are processed independently from each other, where each document is fed through the feature extraction pipeline. The pipeline consists of two stages, where in the first stage a number of feature spaces are filled, and in the second stage the feature spaces of the reference document are merged into a single meta feature- space. The feature spaces for the first stage are: i) stylistic and grammatical errors, ii) basic statistics, e.g. number of lines, iii) stylometric statistics, e.g. hapax legomena, iv) stem suffixes, v) slang words, and vi) sentence structure. The last feature space is optional and not enabled by default, as the run-time increases dramatically, which is due to the use of a sophisticated parser component - the Stanford Parser [2]. All but the first feature space have already been used for Authorship Attribution by our system [1]. For all the feature spaces of the reference documents are then aggregated and com- pared to corresponding the feature spaces of the unknown document. Out of the com- parison a final meta feature space is generated. The binary features of the meta fea- ture space are for the majority of feature spaces: i) more than minimum, ii) less than maximum, iii) within minimum and maximum, and iv) about mean, which integrates the standard deviation. For the grammatical features, a more sophisticated route is taken. Here the probability distribution of individual style and grammar error types are smoothed and pairwise compared between all documents, including the reference docu- ments as well as the reference document. For the comparison the Kolmogorov–Smirnov 1 http://www.languagetool.org/ Table 1. Performance of our Authorship Identification system, where the F1 performance mea- sure is used. Data-Set English Spanish Greek Pan 2012 - Small 0.727 - - Pan 2012 - Medium 0.727 - - Pan 2012 - Large 0.800 - - Pan 2013 - Train 0.800 1.000 0.583 Pan 2013 - Test 0.533 0.560 0.500 test is used. Here the binary meta features are: i) same distribution for close matches, and ii) about the same distribution for less close matches. None of the the involved threshold have been extensively evaluated and were set in a ad-hoc manner. For the final decision the binary of the meta feature space are combined: |Ftrue|F|+|F true | f alse | , where Ftrue is the set of all meta features with a positive value. If this ratio excess .35 the unknown document is assumed to be sufficiently similar to the reference documents. Author Profiling For the author profiling the task is to identify the age group and the gender of the author of a given text document. For this task we combined two algorith- mic approaches and two difference feature types. The two algorithmic approaches are: i) Language Models, and ii) a k-NN classification algorithm. In terms of feature types we again used the output of the style and grammar checker, as well as word tri-grams. The system is build in a flexible way which allows to freely combine features and algo- rithms. In the training phase the reference corpus is processed and the Language Models and the k-NN lookup index are build. For all of the groups within the reference data-set a separate Language Model is build, which captured how often a specific feature is used within the document associated with the specific group. For the k-NN classifier, a single Apache Lucene2 index is build, where the user groups are stored are separate fields. When a previously unseen document is processed, the results from the Language Models and the k-NN classifier can be combined. In the case of the Language Models, for each group a score is computed by iterating over all features: scoregroup (f eature) = P P (f eature|group) P (f eature) , where P (f eature|group) is the probability of feature for a given group. In the case of the k-NN classifier, the index is search by using the features of the unseen document as query. The top three results are then examined and the score from the search engine are summed to give a final ranking of groups. When more than one algorithmic approach are used, they are processed in sequence. The first approach which provides a score, instead of no result or a tie, is then taken as final decision. 3 Evaluation To assess the performance of our system for Authorship Identification we report the performance numbers not only for the PAN 2013 data-sets, but also for three data- sets, which we assembled out of the PAN 2012 data-set. In table 1 the performance 2 http://lucene.apache.org/ Table 2. Performance of our Authorship Profiling system on the PAN 2013 data-set for three selected configurations, where the F1 is used as performance measure. Configuration Language Age: 10s Age: 20s Age: 30s Gender: Male Gender: Female k-NN + Trigrams (knn-tri) English 0.263 0.543 0.701 0.613 0.605 Language Model + Grammar (lm-lt) English 0.005 0.031 0.721 0.643 0.375 knn-tri + lm-lt (default) English 0.266 0.527 0.700 0.618 0.603 k-NN + Trigrams (knn-tri) Spanish 0.105 0.601 0.478 0.567 0.554 Language Model + Grammar (lm-lt) Spanish 0.000 0.721 0.134 0.642 0.596 knn-tri + lm-lt (default) Spanish 0.011 0.651 0.458 0.619 0.598 of our system for the available data-sets for the three languages is reported. To assess the performance of our system for Author Profiling, we took the PAN 2013 data-set as provided by the organisers and split it into two parts. The first part, which contains 70% of all conversations is used for training and the remaining conversations are used as testing data-set. In table 2 the performance for three selected configurations is reported. 4 Conclusions We studied the effectiveness of style and grammar errors for Authorship Identification and Author Profiling. Therefore we build a system which combines the output of a grammar checker tool with stylometric features, which have been used for Authorship Attribution already in the past. We found that these features derived from the gram- matical errors does help in such scenarios and that they capture different aspect of the writing style then the remaining stylometric features. We found that further tuning of our system is necessary as the performance figures do vary considerably between dif- ferent data-sets. In the future we further plan to use stylistic and grammatical errors as indicators for authorship, especially as any improvements in detecting these errors will also be beneficial for our approach. References 1. Kern, R., Klampfl, S., Zechner, M.: Vote/veto classification, ensemble clustering and sequence classification for author identification. CLEF 2012 Evaluation Labs and Workshop – Working Notes Papers 2012, 09–20 (2012) 2. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - ACL ’03 pp. 423–430 (2003) 3. Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution, pp. 69–72. No. 2000 (2003) 4. Potthast, M., Gollub, T., Hagen, M., Tippmann, M., Kiesel, Johannes Stamatatos, E.R.P., Stein, B.: Overview of the 5th International Competition on Plagiarism Detection (2013) 5. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science 60(3), 538–556 (2009)