matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating Matteo Brivio University of Tübingen Department of Linguistics matteo.brivio@student.uni-tuebingen.de Abstract is most likely bound to change as the increasing availability of large-scale, time-annotated digital English. This paper describes our con- resources, such as Google n-grams1 , is promoting tribution to the EVALITA 2020 shared research in this direction. Two recent examples of task DaDoEval – Dating Document Eval- this new trend, in line with the present task, are uation. The solution we present is based the Diachronic Text Evaluation shared task organ- on a linear multi-class Support Vector Ma- ised by Popescu et al. (2015) at SemEval 2015 and chine classifier trained on a combination the RetroC Challenge presented by Graliński et al. of character and word n-grams, as well (2017). as number of word tokens per document. In this work we propose a simple, yet effective, Despite its simplicity, the system ranked approach for automatic document dating based on first both in the coarse-grained classifica- a linear multi-class Support Vector Machine clas- tion task on same-genre data and in the one sifier, trained on a combination of character and on cross-genre data, achieving a macro- word n-grams, as well as document length in word average F1 score of 0.934 and 0.413, re- tokens. spectively. The system implementation is The solution is evaluated in the context of available at https://github.com/ the DaDoEval – Dating Document Evaluation – matteobrv/DaDoEval. shared task at EVALITA 2020 (Menini et al., 2020; Basile et al., 2020). The task is based on 1 Introduction the Alcide De Gasperi’s corpus of public docu- ments (Tonelli et al., 2019) and is organised into Temporal information, such as the publication date six sub-tasks: (I) coarse-grained classification on of a document, is of major relevance in a number same-genre data, (II) coarse-grained classification of domains, like historical linguistics and digital on cross-genre data, (III) fine-grained classifica- humanities (Niculae et al., 2014). This is arguably tion on same-genre data, (IV) fine-grained classifi- even more true for a wide range of information re- cation on cross-genre data, (V) year-based classi- trieval tasks, such as document exploration, simi- fication on same-genre data, (VI) year-based clas- larity search, summarisation and clustering, where sification on cross-genre data. the temporal dimension plays a major role in im- The proposed solution tackles the first two sub- proving search results (Alonso et al., 2007; Alonso tasks, coarse-grained classification on same-genre et al., 2011). and cross-genre data. Both sub-tasks require to Such information, however, is not always read- correctly assign document samples to one of the ily available and must therefore be inferred, rely- main five time periods identified in De Gasperi’s ing either on qualitative or quantitative methods, political life, spanning a range of over fifty years if not both (Ciula, 2017). Nonetheless, despite from 1901 to 1954. their significance, methods for temporal text clas- The paper is structured as follows: in section 2 sification and automatic document dating are still we provide a brief overview of the training data rather unexplored compared to other text classifi- set, in section 3 we go over the system setup and cation tasks (Niculae et al., 2014). This, however, describe the feature space, section 4 is dedicated Copyright © 2020 for this paper by its authors. Use per- to results analysis and discussion, in section 5 we mitted under Creative Commons License Attribution 4.0 In- 1 ternational (CC BY 4.0). http://books.google.com/ngrams 1901-1918 1919-1926 1927-1942 1943-1947 1948-1954 S AMPLES PER CLASS 572 342 150 514 632 AVG . SAMPLE LENGTH 867 1033 3044 633 1209 Table 1: Training set overview, showing the number of document samples per class and the average number of word tokens per sample, rounded up to the nearest integer. consider possible improvements while section 6 is 3.1 Feature space reserved for final remarks. The system relies solely on the data provided by the task organisers and is split into training set 2 Data (80%) and development set (20%). No preprocess- The training data set released for the shared task ing is applied, as measures such as case normali- includes 2,210 document samples extracted from sation and punctuation removal do not seem to im- the Alcide De Gasperi’s corpus of public docu- prove the classification result on the development ments, a multi-genre collection of 2,759 texts writ- set, but rather to worsen it. ten or transcribed between 1901 and 1954 (Tonelli Each document in the data set is represented et al., 2019). using three sets of features: document length in With respect to the coarse-grained classifica- terms of word tokens as well as character and word tion sub-tasks, the given samples are organised n-grams. In this respect, we explore the idea that into five classes (see Table 1) corresponding to SVMs trained on combinations of character and the main time periods historians identified in De word n-grams are particularly effective in tackling Gasperi’s political life: Habsburg years 1901- text classification tasks (Çöltekin and Rama, 2017; 1918, Beginning of political activity 1919-1926, Çöltekin and Rama, 2018). Internal exile 1927-1942, From fascism to the Ital- Character n-grams are extracted for n ∈ ian Republic 1943-1947, Building the Italian Re- {3, 4, 5} and span across word boundaries, thus public 1948-1954. capturing punctuation and space characters occur- A preliminary analysis of the data set reveals an ring at the beginning and at the end of each word imbalanced class distribution, with a significantly token. Word n-grams, on the other hand, are ex- lower number of samples in the third class, cor- tracted for n ∈ {1, 2}. Both feature sets are responding to the 1927-1942 interval. This, how- weighted using term-frequency, inverse-document ever, is partially mitigated by the markedly higher frequency (TF-IDF) to scale down the impact of average number of word tokens per sample ob- the most frequent n-grams. served in this class compared to the other ones. The number of word tokens per document is computed in a naive way, splitting each sample at 3 System Description every white space. Similarly to n-gram features, The proposed solution is based on a Support Vec- tokens count are scaled down to a 0-1 range in an tor Machine (SVM) classifier implemented using attempt to avoid numerical problems and prevent the Scikit-learn library (Pedregosa et al., 2011). features in higher numeric ranges from dominat- To account for the rather imbalanced data set, ing those in smaller ones (Hsu et al., 2003). the SVM is tuned in such a way that classes are as- 3.2 Optimisation and Tuning signed weights inversely proportional to their fre- quency in the input data. The system hyper-parameters are optimised to ob- Following the assumption that most text tain the best F1 score on the development set. categorisation problems are linearly separa- A subset of the hyper-parameters is tuned em- ble (Joachims, 1998) the model uses a lin- pirically through several experiments or on the ba- ear kernel implemented in terms of libsvm sis of existing literature. This is the case for kernel (Chang and Lin, 2011) while relying on a type, decision strategy, class balancing, tolerance one-versus-one decision strategy to handle for stopping criterion (tol) and n-grams size. both sub-tasks as multi-class, single label, classi- The remaining hyper-parameters considered fication problems. during optimisation are the regularisation param- eter (C) together with the maximum and minimum S UB - TASK T EAM RUN M ACRO F1 document frequency (max df, min df), which same-genre matteo-brv 1 0.934 in the present approach are used to set an ac- 2 0.934 ceptance threshold for high and low frequency n- team 1 1 0.858 grams. 2 0.855 baseline - 0.827 C OMPONENT PARAMETER VALUE cross-genre matteo-brv 1 0.413 TfidfVectorizer analyzer word 2 0.413 max df 0.9 team 1 1 0.392 min df 0.004 baseline - 0.368 ngram range (1, 2) team 1 2 0.366 lowercase False TfidfVectorizer analyzer char Table 3: Final rankings for sub-task 1 and 2 in max df 0.3 terms of macro-average F1 scores. min df 0.001 ngram range (3, 5) lowercase False 4.1 Classification on same-genre data SVM kernel linear The runs submitted for the first sub-task are based decision function ovo on test samples of the same genre as the ones in tol 1e-12 the training set. The system scored well above C 0.881 the baseline, which was computed with a Logistic class weight balanced Regression model trained on TF-IDF-weighted word unigrams, without performing any prepro- Table 2: Final hyper-parameters setup for each cessing. system component. Overall, the results registered on the test set are These hyper-parameters are tuned through in line with those observed during training. This is the BayesSearchCV algorithm implemented confirmed by the data summarised in Table 4 and in the scikit-optimize library (Head et al., by the confusion matrix in Figure 1. 2020), using a 5-fold-shuffled cross validation. The confusion matrix depicts a run on the de- BayesSearchCV relies on Bayesian Optimi- velopment set which achieved a macro-average sation and explores the hyper-parameters search F1 score of 0.95, while Table 4 reports the per- space exploiting the information available from class results of the best test run submitted for the previous evaluations. This is in contrast to other sub-task. In both cases 1919-1926, 1943-1947 approaches, such as grid and random search, and 1948-1954 are the classes showing the highest which move across the search space either in an number of misclassifications and, incidentally, are exhaustive or completely random manner. also the ones corresponding to the shortest time Table 2 summarises the best hyper-parameters periods. setup obtained from the tuning process. C LASS P RECISION R ECALL F1 4 Results 1901-1918 0.914 0.986 0.948 In this section we present the results for the two 1919-1926 0.96 0.872 0.913 sub-tasks the system participated to. Results are 1927-1942 0.973 0.973 0.973 summarised in Table 3 and reported in terms of 1943-1947 0.898 0.898 0.898 macro-average F1 score. 1948-1954 0.939 0.933 0.936 The system ranked first both in the same-genre and in the cross-genre coarse-grained classifica- Table 4: Per-class results of the best test run for tion task, obtaining a macro-average F1 score of sub-task 1. 0.934 and 0.413, respectively. 120 5 Possible improvements 1901-1918 114 0 0 0 0 Results for the same-genre task are quite encour- 100 aging and in line with those obtained on the de- 1919-1926 6 59 0 2 0 velopment set, where the F1 score ranges between 80 0.92 and 0.96. However, with the current data True label 1927-1942 1 0 31 0 0 and setup, there might not be much room for fur- 60 ther improvement. Nonetheless, additional fea- 1943-1947 1 0 0 87 9 tures like richness measures and linguistically mo- 40 tivated features (e.g. POS tags) are explored in 1948-1954 0 0 0 5 127 other contributions (Štajner and Zampieri, 2013; 20 Zampieri et al., 2016) and could help achieve more stable results. 18 26 42 47 54 0 -19 -19 -19 -19 -19 01 19 27 43 48 On the other hand, results for the second sub- 19 19 19 19 19 Predicted label task suggest a lack of generalisation on cross- genre, out-of-domain data. In this respect, even Figure 1: Confusion matrix for a development set though SVM-based systems for text classification run with a macro-average F1 score of 0.95. should be able to perform well and take advan- tage of high dimensional feature spaces (Joachims, 4.2 Classification on cross-genre data 1998), it might still be worthwhile experimenting with some feature selection methods. Another an- The runs submitted for the second sub-task are gle worth considering is that the system might be based on samples coming from a cross-genre, out- too sensitive to the shallow n-gram features used of-domain test data set. These samples are a sub- to represent the training data. In this case, in- set of the documents collected for the Epistolario cluding deeper text features, such as those encod- project (Tonelli et al., 2020), an ongoing effort to ing syntactic information, might help the system create a digital archive of Alcide De Gasperi’s pri- to abstract away from the lexical level. A first vate and public correspondence. step in this direction is attempted by Szymanski and Lynch (2015) who employ Google Syntac- C LASS P RECISION R ECALL F1 tic N-grams in an SVM-based system that partic- 1901-1918 0.583 0.7 0.636 ipated to the Diachronic Text Evaluation shared 1919-1926 1.0 0.15 0.261 task (Popescu et al., 2015) at SemEval 2015. 1927-1942 0.0 0.0 0.0 1943-1947 0.6 0.75 0.667 6 Conclusions 1948-1954 0.354 0.85 0.5 In this paper we describe a simple, yet effective, Table 5: Per-class results of the best test run for approach for automatic document dating imple- sub-task 2. mented for the DaDoEval shared task at EVALITA As expected, despite scoring above the base- 2020. The system is based on a linear Support line, cross-genre results are significantly lower Vector Machine and is trained on a small set of than those obtained in the same-genre task. Per- stylistic and lexical features, resulting in a fast and class results summarised in Table 5 show how efficient classification model. promising system performances registered in the In particular, the approach achieves top scores same-genre task do not transfer to the cross-genre in both coarse-grained classification sub-tasks, one, suggesting a poor ability of the model to gen- thus confirming that SVM-based systems trained eralise. Particularly interesting and worth investi- on character and word n-grams are indeed well gating are the results registered for the third class, suited to tackle text classification problems. corresponding to the 1927-1942 interval. With re- Nonetheless, results observed in the second task spect to this class precision and recall values are suggest that the model does not generalise well equal to 0, indicating that model did not recognise on cross-genre data, leaving room for further im- any sample as belonging to this time period. provements. Acknowledgments Octavian Popescu and Carlo Strapparava. 2015. Semeval 2015, task 7: Diachronic text evaluation. We thank Dr. Çağrı Çöltekin for his patient en- In Proceedings of the 9th International Workshop on couragement and valuable suggestions throughout Semantic Evaluation (SemEval 2015), 870–878. this project. Omar Alonso, Strötgen Jannik, Baeza Y. Ricardo and Gertz Michael. 2011. Temporal Information Re- trieval: Challenges and Opportunities. In Proceed- References ings of the 1st International Temporal Web Analytics Workshop, 11:1–8. Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- cia C. Passaro. 2020. Evalita 2020: Overview of Omar Alonso, Gertz Michael and Baeza Y. Ricardo. the 7th evaluation campaign of natural language pro- 2007. On the value of temporal information in in- cessing and speech tools for italian. In Proceedings formation retrieval. SIGIR Forum, 41:35–41. of Seventh Evaluation Campaign of Natural Lan- guage Processing and Speech Tools for Italian. Fi- Sanja Štajner and Marcos Zampieri. 2013. Stylistic nal Workshop (EVALITA 2020), Online. CEUR.org. Changes for Temporal Text Classification. In Pro- ceedings of the 16th International Conference on Arianna Ciula. 2017. Digital palaeography: What is Text, Speech and Dialogue (TSD), Lecture Notes in digital about it? Digital Scholarship in the Humani- Artificial Intelligence - LNAI 8082, Springer, 519- ties, 32(2):ii89–ii105. 526. Sara Tonelli, Rachele Sprugnoli and Giovanni Moretti. Çağrı Çöltekin, Taraka Rama. 2018. Tübingen-oslo 2019. Prendo la Parola in Questo Consesso Mon- at SemEval-2018 task 2: SVMs perform better than diale: A Multi-Genre 20th Century Corpus in the RNNs in emoji prediction. In Proceedings of The Political Domain. In Proceedings of CLIC-it 2019. 12th International Workshop on Semantic Evalua- tion, 34-38. Sara Tonelli, Rachele Sprugnoli, Giovanni Moretti, Stefano Malfatti and Marco Odorizzi. 2020. Episto- Çağrı Çöltekin, Taraka Rama. 2017. Tübingen sys- lario De Gasperi: National Edition of De Gasperi’s tem in VarDial 2017 shared task: experiments with Letters in Digital Format. In Proceedings of language identification and cross-lingual parsing. AIUCD. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (Var- Stefano Menini, Giovanni Moretti, Rachele Sprugnoli Dial), 146-155. and Sara Tonelli. 2020. DaDoEval @ EVALITA 2020: Same-Genre and Cross-Genre Dating of His- Chih-chung Chang, Chih-jen Lin. 2011. LIBSVM: torical Documents. In Proceedings of Seventh Eval- A library for support vector machines. ACM uation Campaign of Natural Language Process- Transactions on Intelligent Systems and Technology, ing and Speech Tools for Italian. Final Workshop 2(3):27:1–27:27. (EVALITA 2020). Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin. Terrence Szymanski and Gerard Lynch. 2015. 2003. A practical guide to support vector classifi- UCD: Diachronic Text Classification with Charac- cation. Technical report, Department of Computer ter, Word, and Syntactic N-grams. In Proceedings of Science, National Taiwan University. the 9th International Workshop on Semantic Evalu- ation (SemEval 2015), 879–883. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Thorsten Joachims. 1998. Text categorization with R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. support vector machines: Learning with many rel- Cournapeau, M. Brucher, M. Perrot and E. Duch- evant features. In Proceedings of the 10th Euro- esnay. 2011. Scikit-learn: Machine Learning in pean Conference on Machine Learning (ECML’98), Python. Journal of Machine Learning Research, 1398:137–142. 12:2825–2830. Tim Head, Manoj Kumar, Holger Nahrstaedt, Gilles Louppe and Iaroslav Shcherbatyi. 2020. Filip Graliński, Rafał Jaworski, Łukasz Borchmann scikit-optimize/scikit-optimize (Version v0.8.1). and Piotr Wierzchoń. 2017. The RetroC Challenge: Zenodo http://doi.org/10.5281/zenodo.4014775. How to Guess the Publication Year of a Text?. In Proceedings of the 2nd International Conference on Vlad Niculae, Marcos Zampieri, Liviu Dinu and Digital Access to Textual Cultural Heritage, 29–34. Alina M. Ciobanu. 2014. Temporal Text Ranking and Automatic Dating of Texts. In Proceedings of Marcos Zampieri, Shervin Malmasi and Mark Dras. the 14th Conference of the European Chapter of the 2016. Modeling Language Change in Historical Association for Computational Linguistics (Volume Corpora: The Case of Portuguese. In Proceedings 2: Short Papers), 2:17–21. of the 10th International Conference on Language Resources and Evaluation (LREC’16), 4098–4104.