1 Introduction

matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating

Matteo Brivio

matteo.brivio@student.uni-tuebingen.de 0 0 University of Tu ̈ bingen Department of Linguistics

English. This paper describes our contribution to the EVALITA 2020 shared task DaDoEval - Dating Document Evaluation. The solution we present is based on a linear multi-class Support Vector Machine classifier trained on a combination of character and word n-grams, as well as number of word tokens per document. Despite its simplicity, the system ranked first both in the coarse-grained classification task on same-genre data and in the one on cross-genre data, achieving a macroaverage F1 score of 0.934 and 0.413, respectively. The system implementation is available at https://github.com/ matteobrv/DaDoEval.

1 Introduction

Temporal information, such as the publication date of a document, is of major relevance in a number of domains, like historical linguistics and digital humanities (Niculae et al., 2014) . This is arguably even more true for a wide range of information retrieval tasks, such as document exploration, similarity search, summarisation and clustering, where the temporal dimension plays a major role in improving search results (Alonso et al., 2007; Alonso et al., 2011) .

Such information, however, is not always readily available and must therefore be inferred, relying either on qualitative or quantitative methods, if not both (Ciula, 2017) . Nonetheless, despite their significance, methods for temporal text classification and automatic document dating are still rather unexplored compared to other text classification tasks (Niculae et al., 2014) . This, however,

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). is most likely bound to change as the increasing availability of large-scale, time-annotated digital resources, such as Google n-grams1, is promoting research in this direction. Two recent examples of this new trend, in line with the present task, are the Diachronic Text Evaluation shared task organised by Popescu et al. (2015) at SemEval 2015 and the RetroC Challenge presented by Gralin´ski et al. (2017).

In this work we propose a simple, yet effective, approach for automatic document dating based on a linear multi-class Support Vector Machine classifier, trained on a combination of character and word n-grams, as well as document length in word tokens.

The solution is evaluated in the context of the DaDoEval – Dating Document Evaluation – shared task at EVALITA 2020 (Menini et al., 2020; Basile et al., 2020) . The task is based on the Alcide De Gasperi’s corpus of public documents (Tonelli et al., 2019) and is organised into six sub-tasks: (I) coarse-grained classification on same-genre data, (II) coarse-grained classification on cross-genre data, (III) fine-grained classification on same-genre data, (IV) fine-grained classification on cross-genre data, (V) year-based classification on same-genre data, (VI) year-based classification on cross-genre data.

The proposed solution tackles the first two subtasks, coarse-grained classification on same-genre and cross-genre data. Both sub-tasks require to correctly assign document samples to one of the main five time periods identified in De Gasperi’s political life, spanning a range of over fifty years from 1901 to 1954.

The paper is structured as follows: in section 2 we provide a brief overview of the training data set, in section 3 we go over the system setup and describe the feature space, section 4 is dedicated to results analysis and discussion, in section 5 we 1http://books.google.com/ngrams consider possible improvements while section 6 is reserved for final remarks. 2

Data

The training data set released for the shared task includes 2,210 document samples extracted from the Alcide De Gasperi’s corpus of public documents, a multi-genre collection of 2,759 texts written or transcribed between 1901 and 1954 (Tonelli et al., 2019) .

With respect to the coarse-grained classification sub-tasks, the given samples are organised into five classes (see Table 1) corresponding to the main time periods historians identified in De Gasperi’s political life: Habsburg years 19011918, Beginning of political activity 1919-1926, Internal exile 1927-1942, From fascism to the Italian Republic 1943-1947, Building the Italian Republic 1948-1954.

A preliminary analysis of the data set reveals an imbalanced class distribution, with a significantly lower number of samples in the third class, corresponding to the 1927-1942 interval. This, however, is partially mitigated by the markedly higher average number of word tokens per sample observed in this class compared to the other ones. 3

System Description

The proposed solution is based on a Support Vector Machine (SVM) classifier implemented using the Scikit-learn library (Pedregosa et al., 2011) .

To account for the rather imbalanced data set, the SVM is tuned in such a way that classes are assigned weights inversely proportional to their frequency in the input data.

Following the assumption that most text categorisation problems are linearly separable (Joachims, 1998) the model uses a linear kernel implemented in terms of libsvm (Chang and Lin, 2011) while relying on a one-versus-one decision strategy to handle both sub-tasks as multi-class, single label, classification problems. 3.1

Feature space

The system relies solely on the data provided by the task organisers and is split into training set (80%) and development set (20%). No preprocessing is applied, as measures such as case normalisation and punctuation removal do not seem to improve the classification result on the development set, but rather to worsen it.

Each document in the data set is represented using three sets of features: document length in terms of word tokens as well as character and word n-grams. In this respect, we explore the idea that SVMs trained on combinations of character and word n-grams are particularly effective in tackling text classification tasks ( C¸o¨ltekin and Rama, 2017; C¸o¨ltekin and Rama, 2018) .

Character n-grams are extracted for n 2 f3; 4; 5g and span across word boundaries, thus capturing punctuation and space characters occurring at the beginning and at the end of each word token. Word n-grams, on the other hand, are extracted for n 2 f1; 2g. Both feature sets are weighted using term-frequency, inverse-document frequency (TF-IDF) to scale down the impact of the most frequent n-grams.

The number of word tokens per document is computed in a naive way, splitting each sample at every white space. Similarly to n-gram features, tokens count are scaled down to a 0-1 range in an attempt to avoid numerical problems and prevent features in higher numeric ranges from dominating those in smaller ones (Hsu et al., 2003) . 3.2

Optimisation and Tuning

The system hyper-parameters are optimised to obtain the best F1 score on the development set.

A subset of the hyper-parameters is tuned empirically through several experiments or on the basis of existing literature. This is the case for kernel type, decision strategy, class balancing, tolerance for stopping criterion (tol) and n-grams size.

The remaining hyper-parameters considered during optimisation are the regularisation parameter (C) together with the maximum and minimum document frequency (max df, min df), which in the present approach are used to set an acceptance threshold for high and low frequency ngrams.

COMPONENT PARAMETER VALUE TfidfVectorizer analyzer

max df min df ngram range lowercase

TfidfVectorizer analyzer

max df min df ngram range lowercase SVM kernel decision function tol C class weight word 0.9 0.004 (1, 2) False char 0.3 0.001 (3, 5) False linear ovo 1e-12 0.881 balanced

These hyper-parameters are tuned through the BayesSearchCV algorithm implemented in the scikit-optimize library (Head et al., 2020) , using a 5-fold-shuffled cross validation. BayesSearchCV relies on Bayesian Optimisation and explores the hyper-parameters search space exploiting the information available from previous evaluations. This is in contrast to other approaches, such as grid and random search, which move across the search space either in an exhaustive or completely random manner.

Table 2 summarises the best hyper-parameters setup obtained from the tuning process. 4

Results

In this section we present the results for the two sub-tasks the system participated to. Results are summarised in Table 3 and reported in terms of macro-average F1 score.

The system ranked first both in the same-genre and in the cross-genre coarse-grained classification task, obtaining a macro-average F1 score of 0.934 and 0.413, respectively.

TEAM RUN MACRO F1

same-genre

matteo-brv cross-genre matteo-brv team 1 baseline team 1 baseline team 1 The runs submitted for the first sub-task are based on test samples of the same genre as the ones in the training set. The system scored well above the baseline, which was computed with a Logistic Regression model trained on TF-IDF-weighted word unigrams, without performing any preprocessing.

Overall, the results registered on the test set are in line with those observed during training. This is confirmed by the data summarised in Table 4 and by the confusion matrix in Figure 1.

The confusion matrix depicts a run on the development set which achieved a macro-average F1 score of 0.95, while Table 4 reports the perclass results of the best test run submitted for the sub-task. In both cases 1919-1926, 1943-1947 and 1948-1954 are the classes showing the highest number of misclassifications and, incidentally, are also the ones corresponding to the shortest time periods.

CLASS

1919-1926 1927-1942 1943-1947

Predicted label The runs submitted for the second sub-task are based on samples coming from a cross-genre, outof-domain test data set. These samples are a subset of the documents collected for the Epistolario project (Tonelli et al., 2020) , an ongoing effort to create a digital archive of Alcide De Gasperi’s private and public correspondence.

Possible improvements

Results for the same-genre task are quite encouraging and in line with those obtained on the development set, where the F1 score ranges between 0.92 and 0.96. However, with the current data and setup, there might not be much room for further improvement. Nonetheless, additional features like richness measures and linguistically motivated features (e.g. POS tags) are explored in other contributions (Sˇ tajner and Zampieri, 2013; Zampieri et al., 2016) and could help achieve more stable results.

On the other hand, results for the second subtask suggest a lack of generalisation on crossgenre, out-of-domain data. In this respect, even though SVM-based systems for text classification should be able to perform well and take advantage of high dimensional feature spaces (Joachims, 1998) , it might still be worthwhile experimenting with some feature selection methods. Another angle worth considering is that the system might be too sensitive to the shallow n-gram features used to represent the training data. In this case, including deeper text features, such as those encoding syntactic information, might help the system to abstract away from the lexical level. A first step in this direction is attempted by Szymanski and Lynch (2015) who employ Google Syntactic N-grams in an SVM-based system that participated to the Diachronic Text Evaluation shared task (Popescu et al., 2015) at SemEval 2015. 6

Conclusions

In this paper we describe a simple, yet effective, approach for automatic document dating implemented for the DaDoEval shared task at EVALITA 2020. The system is based on a linear Support Vector Machine and is trained on a small set of stylistic and lexical features, resulting in a fast and efficient classification model.

In particular, the approach achieves top scores in both coarse-grained classification sub-tasks, thus confirming that SVM-based systems trained on character and word n-grams are indeed well suited to tackle text classification problems.

Nonetheless, results observed in the second task suggest that the model does not generalise well on cross-genre data, leaving room for further improvements.

As expected, despite scoring above the baseline, cross-genre results are significantly lower than those obtained in the same-genre task. Perclass results summarised in Table 5 show how promising system performances registered in the same-genre task do not transfer to the cross-genre one, suggesting a poor ability of the model to generalise. Particularly interesting and worth investigating are the results registered for the third class, corresponding to the 1927-1942 interval. With respect to this class precision and recall values are equal to 0, indicating that model did not recognise any sample as belonging to this time period.

Acknowledgments

We thank Dr. C¸ ag˘ rı C¸ o¨ ltekin for his patient encouragement and valuable suggestions throughout this project.

Valerio

Basile , Danilo Croce, Maria Di Maro, and Lucia

Passaro . 2020 . Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian . In Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020 ), Online . CEUR.org.

Arianna

Ciula . 2017 . Digital palaeography: What is digital about it? Digital Scholarship in the Humanities , 32 ( 2 ): ii89 - ii105 .

C¸ag˘rı C¸ o¨ltekin , Taraka Rama. 2018 . Tu¨bingen-oslo at SemEval-2018 task 2: SVMs perform better than RNNs in emoji prediction . In Proceedings of The 12th International Workshop on Semantic Evaluation , 34 - 38 .

C¸ag˘rı C¸ o¨ltekin , Taraka Rama. 2017 . Tu¨bingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing . In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) , 146 - 155 .

Chih-chung

Chang

, Chih-jen Lin . 2011 . LIBSVM: A library for support vector machines . ACM Transactions on Intelligent Systems and Technology , 2 ( 3 ): 27 : 1 - 27 : 27 .

Chih-Wei

Hsu

, Chih-Chung Chang and Chih-Jen Lin . 2003 . A practical guide to support vector classification . Technical report , Department of Computer Science, National Taiwan University.

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot and

Duchesnay . 2011 . Scikit-learn: Machine Learning in Python . Journal of Machine Learning Research , 12 : 2825 - 2830 .

Filip

Gralin

´ski, Rafał Jaworski, Łukasz Borchmann and Piotr Wierzchon´. 2017 . The RetroC Challenge: How to Guess the Publication Year of a Text? . In Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage , 29 - 34 .

Marcos

Zampieri , Shervin Malmasi and

Mark

Dras . 2016 . Modeling Language Change in Historical Corpora: The Case of Portuguese . In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16) , 4098 - 4104 .

Octavian

Popescu and

Carlo

Strapparava . 2015 . Semeval 2015, task 7: Diachronic text evaluation . In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015 ), 870 - 878 .

Omar

Alonso , Stro¨tgen Jannik, Baeza Y. Ricardo and

Gertz

Michael . 2011 . Temporal Information Retrieval: Challenges and Opportunities . In Proceedings of the 1st International Temporal Web Analytics Workshop , 11 : 1 - 8 .

Omar

Alonso , Gertz Michael and

Baeza Y.

Ricardo . 2007 . On the value of temporal information in information retrieval . SIGIR Forum , 41 : 35 - 41 .

Sanja Sˇtajner and Marcos Zampieri . 2013 . Stylistic Changes for Temporal Text Classification . In Proceedings of the 16th International Conference on Text, Speech and Dialogue (TSD), Lecture Notes in Artificial Intelligence - LNAI 8082 , Springer, 519 - 526 .

Sara

Tonelli , Rachele Sprugnoli and

Giovanni

Moretti . 2019 . Prendo la Parola in Questo Consesso Mondiale: A Multi-Genre 20th Century Corpus in the Political Domain . In Proceedings of CLIC-it 2019 .

Sara

Tonelli , Rachele Sprugnoli, Giovanni Moretti, Stefano Malfatti and

Marco

Odorizzi . 2020 . Epistolario De Gasperi: National Edition of De Gasperi 's Letters in Digital Format . In Proceedings of AIUCD.

Stefano

Menini , Giovanni Moretti, Rachele Sprugnoli and

Sara

Tonelli . 2020 . DaDoEval @ EVALITA 2020: Same-Genre and Cross-Genre Dating of Historical Documents . In Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020 ).

Terrence

Szymanski and

Gerard

Lynch . 2015 . UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams . In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015 ), 879 - 883 .

Thorsten

Joachims . 1998 . Text categorization with support vector machines: Learning with many relevant features . In Proceedings of the 10th European Conference on Machine Learning (ECML'98) , 1398 : 137 - 142 .

Head , Manoj Kumar, Holger Nahrstaedt, Gilles Louppe and Iaroslav

Shcherbatyi . 2020 .

scikit-optimize/scikit-optimize (

Version v0.8 .1).

Zenodo http://doi.org/10.5281/zenodo.4014775.

Vlad

Niculae , Marcos Zampieri, Liviu Dinu and Alina M. Ciobanu . 2014 . Temporal Text Ranking and Automatic Dating of Texts . In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , 2 : 17 - 21 .