Process-Oriented Characteristics of an Idiolect for Authorship Attribution of Heterogeneous Texts: a Pilot Study Tatiana Litvinova1[0000-0002-6019-3700] 1 RusProfiling Lab, Voronezh State Pedagogical University, 86 Lenina st., Voronezh 394043, Russia centr_rus_yaz@mail.ru Abstract. Currently, the task of identification of the author of a typed text is approached mostly in two ways: by means of linguistic analysis (stylometric approach) and analyzing typing behavior (keystroke dynamics approach). The studies which combine these approaches by analyzing complex, process- oriented idiolectal features, although potentially feasible, remain rare. Moreo- ver, existing research focuses mostly on one communication task, thus the ques- tion of stability of such features in one’s idiolect remain open. The paper pre- sents the results of a pilot study aimed at assessing discriminative ability of two groups of idiolectal markers – non-sequential and sequential process-oriented markers – both separately and in combination in a dataset of heterogeneous texts (dialogues and monologues of different genres) produced by the same writers whose typing process was video-recorded and afterwards manually an- notated. The analysis conducted with the use of state-of-the-art methods of identifying the structure in multivariate data and its visualization widely applied in “omics” studies (multilevel PCA, PLS-DA, DIABLO) which have a lot in common with idiolectal studies (a small sample size, a large number of highly correlated predictors) has shown that, despite a strong text-type effect, product- oriented features could discriminate between authors of a short typed text. Fur- ther studies on a larger dataset are needed to develop and to test new sets of product-oriented idiolectal markers, which will contribute to our understanding of an idiolect and enhance development of new methods for idiolect identifica- tion. Keywords: Authorship Attribution, Corpus Linguistics, Linguistic Resource, Idiolect, Typing Behavior, Multivariate Data Analysis. 1 Introduction Authorship attribution (AA), i.e. the problem of attributing a text to its author, is a task of a considerable practical importance. In recent years, this problem has been tackled mostly as one of the text classification problems and one of the subtasks of user identification problem. Different types of linguistics markers have been used as features (word and part-of-speech statistics, text length, etc.) [7]. Since the main ob- Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0) 2 ject of modern AA is a typed text, another line of research is a study of the speed and rhythm of typing patterns (keystroke dynamics, KD) [9]. Comparative studies of sty- lometric and typing features for AA report significantly higher accuracies of KD- based models despite being two orders of magnitude smaller [15] (see also [17] for similar conclusions on superiority of KD feature over stylometric markers in AA task). The disadvantage of KD-based features is a lack on interpretability since no linguistic information is typically used in such analysis. Moreover, as a rule, studies on KD are based on texts produced in the same communication situation (single- domain user identification) without considering the stability of idiolectal markers in cross-domain scenario, with rare exceptions showing sensitivity of KD markers to small differences in writing tasks [4]. The same is also true for stylometric features: most of them were shown to be domain dependent [10], which explains the extreme difficulty of cross-domain AA. Since despite decades of research AA remains challenging, especially in a typical forensic scenario characterized by a small number of samples (texts), heterogeneity of training and testing documents in terms of genre, topic, mode, etc. [14] and demand for interpretability of the results, new types of idiolectal markers are urgently needed which would combine both typing data and linguistic information and as well as re- search on their stability under the effect of different factors of intraidiolectal variation (psychological state of the author, genre of the text, mode, etc.). This paper aims to make the first steps in this direction. The contribution of the paper is as follows. 1. The first resource to study process-oriented idiolectal features in Russian typed texts is introduced. 2. State-of-the-art methods from bioinformatics designed for multivariate data analy- sis involving a small sample size and a large number of highly correlated predic- tors have been applied for the first time to AA task and related text type effect elimination. 3. For the first time discriminative ability of process-oriented idiolectal features has been researched for the task of AA of highly heterogeneous texts. The rest of the paper is organized as follows. In Section 2, related work on process- oriented idiolectal features and their usefulness in idiolectal research is briefly out- lined. Section 3 describes the methodology of the study. First, the research corpus is introduced, the process of its compilation and annotation is described. Second, two types of process-oriented idiolectal features are presented. Third, the methods of mul- tivariate data analysis are described. In Section 4, the results of the pilot study on AA of highly heterogeneous texts based on process-oriented idiolectal features are pre- sented. In Section 5 the conclusions are drawn, and directions for future work are outlined. 3 2 Related Work Although there are a lot of works dealing with AA using stylometry and keystroke information separately, there are only few ones which use process-oriented idiolectal features. Hybrid, process-oriented features for AA was introduced for the first time in [14]. They focused on features based on bursts - cognitive units of texts production be- tween two pauses of fixed length. Their best performing burst feature subset had 72 features comprising of word creation, lexical complexity, revising style, keyboard proficiency, and pre- and post-pause features. No further linguistic analysis of the features is provided, though. Later the same team applied “language production” features to classify authors by group (authorship profiling) [2]. This time feature set included part-of-speech (POS) pauses (mean length of a pause before and after each word as represented by its part of speech), the pause time before and after punctuation marks, misspelling pauses, revision features and typing burst features. One of interesting implications of the study is that some differences between female and male texts in language production and keystroke dynamics were revealed, but not in traditional stylometric indicators. Overall, the use of process-oriented features in AA, despite having been shown to be prospective, is heavily understudied, particularly in a cross-domain scenario. This is partially explained by the difficulties in collecting appropriate datasets. To the best of our knowledge, to date there is only one publicly available dataset with annotation for writing events (namely revisions) [5]. 3 Methodology 3.1 Dataset To study the process-oriented idiolectal markers and their usefulness for AA, the da- taset “Multifactor” created in Corpus Idiolectology Lab was used. This dataset con- tains highly heterogeneous texts produced by the same authors. Texts differ in modes (oral and written), types (dialogues and monologues), genres, topics. Along with texts, dataset contains sound files, transcriptions of oral texts manually annotated for elementary discourse units (EDUs) as well as video files containing screen recording of written text production. The dataset consists of 242 texts (2 oral dialogues, 3 writ- ten dialogues, 3 oral monologues, 3 written monologues) by 22 authors (11 females) of age group 18–24. The dataset allows us to conduct research on the stability of idio- lectal markers in different domains. Currently, the corpus is available by request, but there is an ongoing work on pre- paring it for public use. For this particular study, three authors were selected randomly from Multifactor corpus (ID 1, female, ID 2, female, ID 3, male). Each author contributed 3 dialogues (with different interlocutors; topic – “Guess the movie on its description”) and 3 monologue texts (letter of complaint, essay, short movie retelling). Manual annotation 4 of the timing of text production has been performed by two linguists with the use of ELAN software [6]. All the pauses (time with no action longer than 200 ms) were identified on time scale, as well as all actions (writing events between pauses) (see Fig. 1). After ELAN timing annotation for each event, manual annotation was done as follows. The pauses were classified depending on their duration, s; PAUSE 1 (< 0.49); PAUSE 2 ([0.5; 0.99]), PAUSE 3 ([1; 1.99]), PAUSE 4 ([2; 5]), PAUSE 5 (> 5). These thresholds were selected based on writing research literature (see [12] and references therein). “Full” words (as well as words with one or two misspellings allowing to unambiguously detect POS) were supplied with their POS tags, punctua- tion marks were assigned their labels (PERIOD, COMMA, DASH, QM (for question mark), EM (for exclamation mark)), deleted words were replaced with DEL tag, parts of words were replaced with PART tags, corrections were replaced with CORR tags. For dialogues additional tags were used: BREAK – start of a new line in one turn, TURN – the end of a turn. Fig. 1. ELAN annotation window An example of annotation of the text “Привет/ ды ниче / валяюсь / а ты / ?” (“Hi / I’m ok / chilling / what about you / ?”) where slash means the end of the line is shown below (Table 1). 5 Table 1. An example of text annotation Event ID Beginning time Ending time Duration ELAN Annotation 1 00:00:00.000 00:00:04.440 00:00:04.440 Pause PAUSE4 2 00:00:04.440 00:00:06.110 00:00:01.670 Привет INT 3 00:00:06.110 00:00:13.260 00:00:07.150 Pause TURN 4 00:00:13.260 00:00:13.630 00:00:00.370 ды PARTICLE 5 00:00:13.630 00:00:13.970 00:00:00.340 Pause PAUSE1 6 00:00:13.970 00:00:14.380 00:00:00.410 ни PART 7 00:00:14.380 00:00:15.430 00:00:01.050 Pause PAUSE2 8 00:00:15.430 00:00:15.635 00:00:00.205 ч PART 9 00:00:15.635 00:00:16.845 00:00:01.210 Pause PAUSE3 10 00:00:16.845 00:00:17.000 00:00:00.155 е PART 11 00:00:17.000 00:00:19.300 00:00:02.300 Pause BREAK 12 00:00:19.300 00:00:20.795 00:00:01.495 валяюм VERB 13 00:00:20.795 00:00:21.645 00:00:00.850 Pause PAUSE2 14 00:00:21.645 00:00:23.195 00:00:01.550 ̶в̶а̶л̶яю ̶ ̶м DEL 15 00:00:23.195 00:00:25.370 00:00:02.175 Pause BREAK 16 00:00:25.370 00:00:26.880 00:00:01.510 валяюсь VERB 17 00:00:26.880 00:00:28.195 00:00:01.315 Pause BREAK 18 00:00:28.195 00:00:28.355 00:00:00.160 а PARTICLE 19 00:00:28.355 00:00:28.730 00:00:00.375 Pause PAUSE1 20 00:00:28.730 00:00:29.170 00:00:00.440 ты PRON To avoid the effect of text length and total timing differences between authors which could spur our results, we restricted ourselves with the first 400 events for each text. The mean text length of final texts produced during these 400 events was 147 words (SD = 22 words) for dialogues and 123 words (SD = 30) for monologues, the latter being shorter than the former (paired t-test p = 0.0008). We calculated the mean time spent in each POS tags, DEL and PART by summing durations of these states in text and dividing it by the number of states in each text. The normality of the data distribu- tion was tested using the Shapiro–Wilk test but not confirmed. The results of non- parametric Wilcoxon signed-rank test with FDR correction showed no differences between the mean time spent in each state except for PART (p.adjust = 0.03906): in dialogues mean time spent in PART is higher. 6 3.2 Feature description Two types of process-oriented idiolectal markers were used in this study. The first group of markers characterize non-sequential characteristics of text production pro- cess. We name them general production features (GPF). 1. General production features:  Pure_Words = total words / (total time – pause time – part_del_corr). This is a measure for author productivity expressing the ratio of words to time spent in pro- ductive writing.  Corr_words = part_del_corr / total words. The ratio of revision events to total words.  WL (word length): total words/total char. The mean word length in characters.  PM_PAUSE: the ratio of time spent in punctuation marks to time spent in pauses.  WL_Time_thinking: the ratio of the mean word length to time spent in editing and pausing.  VERB_DEL – the ratio of number of words to the number of deletions.  PM_DEL – the ratio of number of punctuation marks to the number of deletions.  NOUN_PART, VERB_PART, ADV_PART, PREP_PART, PM_PART – the ratio of number of nouns, verbs, adverbs, prepositions, punctuation marks to the number of PART events.  PREP_CONJ – ratio of total time spent in prepositions to the total time spent in conjunctions.  PART_time, PREP_CONJ, CONJ_time, VERB_time, ADV_time, COMMA_time, PREP_time, DEL_time – total time spent in prepositions, con- junctions, verbs, adverbs, comma, DEL, correspondingly. 2. Sequential features (BIGRAM) Sequential features – the frequencies of all pairs of adjacent events (i.e. bigrams) – were calculated providing one of them is PAUSE event, which resulted in 164 fea- tures. A pre-filtering step to remove bigrams for which the sum of counts is below a certain threshold (we chose 1 % cut-off) compared to the total sum of all counts was made1. Thus, the original features set was reduced to 64. Prior to the main analysis of each feature set, we applied Hellinger standardization where each element is divided by its row sum, after the square root of each element is calculated. This type of standardization was selected since it yields low weights to variables with low counts and a lot of zeros [3]. 3.3 Methods For this pilot study different methods for identifying the structure in multivariate data were used as implemented in R package mixOmics (for the sake of brevity we do not 1 The function for prefiltering was borrowed from http://mixomics.org/mixmc/mixmc-pre- processing/ (assessed 25/10/2020) 7 provide description of the methods here and refer reader to [16]). Namely, Principal component analysis (PCA) and multilevel PCA (PCA with extraction of the within variation matrix) were used to visualize the idiolectal markers variation according to the text type (monologue and dialogue) and author. The multilevel function first de- composes the within (text type) from the between (author) variance in the data sets. This is crucial for our task, since we hypothesize a strong text-type effect in our data which could complicate AA. To complement PCA results, we also applied variation partitioning analysis using R package vegan (see [18] for more details). Partial Least Squares Discriminant Analysis (PLS-DA) was used for text class (au- thor ID) prediction. Stratified 2-fold cross-validation as implemented in mixOmics (since mixOmics package only allows ‘n of classes – 1’ number of folds) and 6-fold (default value) cross-validation as implemented in R package RVAideMemoire [8] was applied to assess the performance of PLS-DA models as well as leave-one-out cross-validation, but the results were similar and we report only the results of 6-fold cross-validation. We also performed permutation tests to assess the significance of PLS-DA models using R package RVAideMemoire. Sample classifiers were scram- bled 999 times and models re-calculated to establish the likelihood of achieving the same result by chance. Then, the dataset integration method DIABLO was used to assess the predictive ability of combination of two feature sets with respect to the outcome (Author). This type of N-integration analysis (i.e. suitable for research where several features set measured on the same individuals – the same N – are avail- able, which is the case for most idiolectal studies) developed by the mixOmics team is aimed to identify a highly correlated multi-dataset signature discriminating the sam- ples (texts) in accordance with their group (author). MixOmics is well-documented R package which allows one to perform different types of state-of-the-art analysis on multivariate data with a special focus on variable selection and data visualization. Originally designed for “omics” (large-scale biologi- cal datasets) data, these methods are suitable for idiolect studies since idiolectal data and idiolect identification tasks have a lot in common with omics data and related problems: multicollinearity and a large number of predictors (p > N problem), a small number of samples, strong connections between different datasets (linguistic levels of idiolect) describing the same individuals. Moreover, using the above mentioned methods with a strong focus on variable importance, identification of the sources of variability and data visualization could shift the paradigm in AA studies most of which are currently using a purely engineering approach with focus on prediction rather than interpretability. 4 Results First, we performed an experiment with general production features (GPF). Broken- stick model [18] has shown that the first two components are worth being interpreted. They account for 65 % of variation. As Fig. 2 shows, there is a clear tendency of text type-based (dialogue (denoted as D_) versus monologue (M_)) clustering over the comp 1 which explains the largest part of variation (49 %). Variation partitioning 8 analysis has proved PCA findings about a text type as the main source of variation in our data. Adj.R.squared for the factor “Text type” is 0.24 (i.e. type explains 24 % of variance in our data from 65 % that could be possible explained by the first 2 compo- nents), while “Author” only 0.06 (i.e. author explains only 6 % of variance in the data). A permutation test has shown that the results of the analysis is significant both for the model with two factors (p=0.002) and for the model with a text type as the main factor (p= 0.001) and author as covariate and for author as main factor and text type as covariate (p = 0.027). PlotIndiv 2_M_3.txt 2_M_2.txt 2 2_D_1.txt 2_D_3.txt 3_D_2.txt 3_D_3.txt 3_D_1.txt 2_D_2.txt 3_M_2.txt 0 1_M_1.txt 2_M_1.txt 1_D_1.txt 1_D_2.txt 3_M_1.txt PC2: 16% expl. var Legend 3_M_3.txt 1_M_2.txt 1 2 -2 3 1_D_3.txt -4 1_M_3.txt -2.5 0.0 2.5 5.0 PC1: 49% expl. var Fig. 2. PCA on GPF feature set However, when we applied multilevel PCA for text-type effect elimination, the ten- dency to author-based clustering has been revealed (Fig. 3). PlotIndiv Mono DialDial 2 Dial Mono 1 Dial Dial Mono Mono Mono PC2: 16% expl. var Legend 0 Mono 1 Mono 2 3 -1 Dial Dial Mono -2 Dial Dial -3 Mono -2.5 0.0 2.5 5.0 PC1: 27% expl. var Fig. 3. Multilevel PCA with genre effect elimination for the GPF feature set 9 Having revealed the main sources of variation in our data using an unsupervised ap- proach, we next move on to a classification experiment to assess the predictive ability of variables. As expected, PLS-DA error rate on data without variance decomposition is high (0.42778), although the model is still significant (permutation test p-value = 0.034). Pairwise permutation tests with FDR correction revealed that all authors differ from each one (all p < 0.05). However, when we constructed PLS-DA on data with the eliminated text-type effect, error rate decreased to 0.175. The permutation test has shown that results are statistically significant (p = 0.001; all pairwise permutation tests p < 0.05). For visualization of the results of PLS-DA model we used color-coded Clustered Image Maps (CIMs) ("heat maps"). CIM is a 2-dimensional visualization of a real- valued matrix with rows and columns reordered according to some hierarchical clus- tering method (we used Ward method and Euclidean distance) to identify some inter- esting patterns in data (i.e. simultaneous clustering of samples and variables). CIM (Fig. 4) shows 3 clear authorial groups (texts are rows) as well as 2 large clusters of authorial markers (variables are columns): one cluster contains variables expressing general productivity; second cluster consists of variables reflecting different particular characteristics of writing processes which are further divided into two smaller clus- ters. Color key 1 2 3 -3.61 0 3.61 1_D_1.txt 1_D_2.txt 1_M_2.txt 1_M_1.txt 1_M_3.txt 1_D_3.txt 3_D_1.txt 3_M_3.txt 3_M_1.txt 3_D_2.txt 3_M_2.txt 3_D_3.txt 2_M_1.txt 2_D_3.txt 2_M_3.txt 2_D_1.txt 2_D_2.txt 2_M_2.txt Pure_Words Corr_words PREP_CONJ VERB_PART PM_PART NOUN_PART PREP_PART ADV_PART WL PART_time DEL_time CONJ_time VERB_time COMMA_time ADV_time VERB_DEL PM_PAUSE WL_Thinking PM_DEL PREP_time Fig. 4. Clustered Image Map on multilevel PLS-DA, GPF set 10 Broken-stick criterion for PCA with BIGRAM features shows that the first 2 compo- nents are meaningful, which accounts for 45 % of variation. Fig. 5 shows that a text type is the main source of variation. Multilevel PCA eliminates a text-type effect: clear authorial clusters are revealed (Fig. 6). PlotIndiv 5.0 3_D_3.txt 3_D_1.txt 1_M_1.txt 3_D_2.txt 2.5 1_M_2.txt 1_D_2.txt 1_M_3.txt 2_D_2.txt 3_M_1.txt 1_D_1.txt 3_M_3.txt PC2: 18% expl. var 0.0 Legend 2_D_1.txt 1 2 3_M_2.txt 2_D_3.txt 3 -2.5 2_M_1.txt 2_M_2.txt -5.0 2_M_3.txt 1_D_3.txt -4 0 4 8 PC1: 27% expl. var Fig. 5. PCA on the BIGRAM feature set PlotIndiv 1_M_1.txt 1_M_2.txt 1_M_3.txt 4 1_D_2.txt 1_D_1.txt PC2: 19% expl. var 2_D_2.txt 3_D_3.txt Legend 2_D_1.txt 3_D_1.txt 1 0 2 3_D_2.txt 3 3_M_1.txt 2_D_3.txt 2_M_2.txt 3_M_2.txt 2_M_1.txt 2_M_3.txt -4 3_M_3.txt 1_D_3.txt -4 0 4 PC1: 22% expl. var Fig. 6. Multilevel PCA on the BIGRAM feature set 11 Variation partition analysis shows that both factors contribute to variation: Adj.R.squared for the factor “Text type” is 0.14 (i.e. it explains 14 % of variance in our data from 45 % that could be possible explained by the first 2 components), while Adj.R.squared for “Author” is 0.12 (i.e. it accounts for 12 % of variance in the data). The permutation test has shown that the results of the analysis are significant both for the model with two factors (p=0.001) and for that with one factor as the main and another as the covariate one (p= 0.001 for both models). PLS-DA performed without variance decomposition shows ER = 0.15 (p-value = 0.001), ER of the model on data with variance decomposition is 0.086111 (p-value = 0.001). For both models, p < 0.05 in pairwise comparisons, except for the model without variance decomposition (p = 0.056 for pair “Author1 – Author 2”). CIM revealed 3 author-based clusters with 1 incorrectly classified text by Author 1 (Fig. 7) as well as 2 large clusters of features. One cluster consists of the features reflecting pause behavior in revisions, the other one contains features reflecting par- ticular characteristics of pausing behavior. Color key 1 2 3 -2.74 0 2.74 3_D_2.txt 3_D_1.txt 3_D_3.txt 3_M_1.txt 3_M_3.txt 3_M_2.txt 1_M_1.txt 1_M_3.txt 1_M_2.txt 1_D_2.txt 1_D_1.txt 2_D_1.txt 2_D_2.txt 2_D_3.txt 2_M_1.txt 2_M_2.txt 1_D_3.txt 2_M_3.txt del_pause4 corr_pause2 part_pause1 part_pause4 noun_pause5 prep_pause2 pause4_noun pause3_pron del_pause3 adv_pause2 pause2_pron pause2_verb pause3_noun noun_pause2 conj_pause2 pron_pause2 del_pause1 corr_pause1 noun_pause1 pause1_noun adv_pause1 verb_pause1 pause1_verb pause1_pron pause2_del pause1_part pause4_part pause5_part pause3_part pause2_part pause3_del pause1_conj Fig. 7. CIM on multilevel PLS-DA performed on BIGRAM feature set 12 Next we move on to the feature set integration (two-block DIABLO). The permuta- tion test based on cross-validation confirmed the significance of the DIABLO model (ER = 0.14167, p-value = 0.001). The first components from each data set are highly correlated (0.81) to each other; the same is true for the second components (0.82). Fig. 8 displays the variables from two blocks selected on component 1 and 2 (cut- off = 0.5). The clusters of points indicate a strong correlation between the variables. Correlation Circle Plots 1.0 PREP_time PM_DEL pause1_adv prep_pause2 pause1_noun pause1_particle pause1_pron pause3_prep 0.5 pause2_noun prep_pause1 WL_Thinking pron_pause1 noun_pause1 PREP_CONJ pause1_prep particle_pause1 NOUN_PART pause2_prep Component 2 noun_pause3 noun_pause2 pause2_verb pron_pause2 Block VERB_time 0.0 conj_pause2 verb_pause3 verb_pause2 bigram part_pause4 corr_pause1 VERB_PART pause2_conj adj_pause1 pause2_pron GPF adv_pause2 PART_time del_pause1 pause1_part pause3_pron -0.5 pause2_comma Corr_words part_pause1 pause1_corrDEL_time WL del_pause4 pause5_del pause2_del del_pause3 pause1_del Pure_Words comma_pause1 pause2_part del_pause2 corr_pause2 pause1_comma -1.0 -1.0 -0.5 0.0 0.5 1.0 Component 1 Fig. 8. Correlation circle for two feature sets CIM on DIABLO model (Fig. 9) allows us to see the signature of each author over the two feature sets. Irrespective of a text type, Author 1 is characterized by the higher values of NOUN_PART, Prep_CONJ, WL_Thinking, PM__DEL, prep_pause1, prep_pause2, prep_time, PM_DEL, VERB_DEL, pause3_prep, pause1_adv, adv_time, adv_part, part_pause2, verb_pause1, pause2_corr. Overall, author 1 is typi- cal spent typing time in prepositions and adverbs, as well as verbs and less time spent in revisions. 13 Author 2 is characterized by higher values of corr_words, higher WL, PART_time, DEL_time, del_pause1, pause1_part, adj_pause1, PM_pause, comma_pause1, pause1_comma, pause2_del, del_pause4. Most of the features which characterize this author are related to the general productivity (she spends more time in revisions, alt- hough has higher word complexity as assessed by word length) and punctuation be- havior. The only POS feature is short pause + adjective bigram. Author 3 is characterized by a higher mean time spent in CONJ, higher values of PREP_PART, VERB_part, VERB_time, PM_PART, verb_pause3, pause2_prep, pause2_verb, pause3_pron, PURE_WORDS, pause1_del, corr_pause2, del_pause2, del_pause3, pause3_noun, pause2_comma. Author 3 is characterized by a larger amount of time spent in conjunctions, verbs, less time spent in PART, i.e. higher speed of word retrieval. Color key -2 -1 0 1 2 Rows bigram GPF Columns ADV_time 1 adv_pause2 2 pron_pause2 3 pron_pause3 noun_pause3 noun_pause2 PM_PART pause2_pron verb_pause2 pause2_comma CONJ_time VERB_time noun_pause4 conj_pause1 PREP_PART pause2_prep pause1_verb adv_pause1 pause1_pron prep_pause1 pause1_particle adj_pause1 PART_time Corr_words prep_pause2 pause3_prep PREP_CONJ WL_Thinking pause1_adj VERB_DEL pause4_del pause2_part del_pause3 WL pause2_del pause5_del pause2_corr part_pause1 pause4_part comma_pause1 comma_pause2 part_pause5 Fig. 9. CIM on DIABLO model 14 Here again we see two large clusters of variables, one of them expressing general characteristics of productivity (at the bottom of Fig. 9), while the second one express- es particular characteristics of the writing process. 5 Conclusions and Future Work We have performed a pilot study into the stability and discriminative ability of a pro- cess-oriented set of idiolectal markers which combine information from stylometry and keystroke dynamics in a very complicated heterogeneous (dialogue and mono- logue texts in dataset) authorship attribution scenario. Based on the performed experiments, we have arrived at the following conclusions: 1. Process-oriented idiolectal markers could be used to detect authors in a highly het- erogeneous dataset. 2. Process-oriented idiolectal markers reflecting information on the duration on paus- es before and after words with a known morphological class (POS) as well as a writing process event related to revision are better predictors of author than pro- cess-oriented idiolectal markers expressing non-sequential information. 3. State-of-the-art methods of analysis and visualization of multivariate data devel- oped originally for large biological datasets are suitable for analyzing idiolects since biological and linguistic data have a lot in common. 4. The methods for variance decomposition allow one to eliminate a strong text-type effect and could be used in cross-domain authorship attribution, which is the most complicated type of AA tasks. As with every pilot study, the presented results could not be generalized, however, they definitely point out the ways of future work. 1. It is necessary to expand the corpus of texts annotated for a process-oriented set of idiolectal markers. In order to do it in a more efficient and a less-time-consuming manner, it is necessary to develop methods which could extract these features au- tomatically. 2. It is urgent to broaden the set of a process-oriented idiolectal markers and a range of communication tasks authors are involved in. 3. One of the prospective ways of future research is to assess the effect of the medium of text product (pen – physical keyboard – touch screen keyboard) on stability and discriminative ability of process-oriented idiolectal markers. Acknowledgement This work is supported by the grant No. 18-78-10081 from Russian Science Founda- tion, which is gratefully acknowledged. 15 References 1. Belman, A. K. et al: Insights from BB-MAS – A Large Dataset for Typing, Gait and Swipes of the Same Person on Desktop, Tablet and Phone (2019). arXiv:abs/1912.02736 (2019). 2. Brizan, D. G., Goodkind, A., Koch, P., Balagani, K., Phoha, V. V., Rosenberg, A.: Utiliz- ing linguistically enhanced keystroke dynamics to predict typist cognition and de- mographics. In: International Journal of Human-Computer Studies, 82, 57–68 (2015). 3. Buttigieg, P.L., Ramette, A.: A Guide to Statistical Analysis in Microbial Ecology: a community-focused, living review of multivariate data analyses. In: FEMS Microbiol Ecol., 90, 543–550 (2014). 4. Conijn, R., Jens, R., van Zaanen, M.: Understanding the keystroke log: the effect of writ- ing task on keystroke features. In: Reading and Writing, 32(9), 2353–2374 (2019). 5. Conijn, R., Speltz, E. D., van Zaanen, M., van Waes, L., Chukharev-Hudilainen, E.: A product and process oriented tagset for revisions in writing. In: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 363–368 (2020). 6. ELAN (Version 5.9) The Language Archive, https://archive.mpi.nl/tla/elan, last accessed 2020/03/01. 7. Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Literary and Linguistic Computing, 22 (3), 251–270 (2007). 8. RVAideMemoire: testing and plotting procedures for biostatistics. R Package Version 0.9- 69, https://CRAN.R-project.org/package=RVAideMemoire, last accessed 2020/02/17. 9. Kostyuchenko, E., Kadyrov, R.: Influence of the System Parameters on the Final Accuracy for the User Identification by Free-Text Keystroke Dynamics. In: Bhatia, S., Tiwari, S., Mishra, K., Trivedi, M. (eds.) Advances in Computer Communication and Computational Sciences. Advances in Intelligent Systems and Computing, 759. Springer, Singapore (2019). 10. Litvinova, T.: Stylometrics Features Under Domain Shift: Do They Really “Context- Independent”? In: Karpov, A., Potapova, R. (eds.) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science, 12335. Springer, Cham (2020). 11. Locklear, H., et al: Continuous authentication with cognition-centric text production and revision features. In: IEEE International Joint Conference on Biometrics, 1–8. Clearwater, FL (2014). 12. Medimorec, S., Risko, E.F.: Pauses in written composition: on the importance of where writers pause. In: Reading and Writing, 30, 1267–1285 (2017). 13. Monaco, J., Stewart, J., Cha, S.-H, Tappert, C.: Behavioral biometric verification of stu- dent identity in online course assessment and authentication of authors in literary works. In: 2013 IEEE Sixth Intl. Conf. on Biometrics: Theory, Applications and Systems, 1–8 (2013). 14. Panicheva, P., Litvinova, T.: Authorship Attribution in Russian in Real-World Forensics Scenario. In: Martín-Vide, C., Purver, M., Pollak, S. (eds.). Statistical Language and Speech Processing. SLSP 2019, Lecture Notes in Computer Science, 11816. Springer, Cham (2019). 15. Plank, B.: Predicting authorship and author traits from keystroke dynamics. In: Proceed- ings of the Second Workshop on Computational Modeling of People’s Opinions, Personal- ity, and Emotions in Social Media, 98–104 (2018). 16. Rohart, F., Gautier, B., Singh, A., Lê Cao, K-A.: mixOmics: an R package for ‘omics’ fea- ture selection and multiple data integration. PLoS Comput Biol, 13(11) (2017). 16 17. Stewart, J. C., Monaco, J. V., Cha, S. H., Tappert, C. C.: An investigation of keystroke and stylometry traits for authenticating online test takers. In: 2011 International Joint Confer- ence on Biometrics (IJCB), 1–7 (2011). 18. Analysis of community ecology data in R, https://www.davidzeleny.net/anadat- r/doku.php/en:start, last accessed 2020/01/20.