matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating

matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating MatteoBrivio matteo.brivio@student.uni-tuebingen.de Department of Linguistics University of Tübingen matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating FD5493D52EA1F63C306836390A0D4161 GROBID - A machine learning software for extracting information from scholarly documents

English. This paper describes our contribution to the EVALITA 2020 shared task DaDoEval -Dating Document Evaluation. The solution we present is based on a linear multi-class Support Vector Machine classifier trained on a combination of character and word n-grams, as well as number of word tokens per document. Despite its simplicity, the system ranked first both in the coarse-grained classification task on same-genre data and in the one on cross-genre data, achieving a macroaverage F1 score of 0.934 and 0.413, respectively. The system implementation is available at https://github.com/ matteobrv/DaDoEval.

Introduction

Temporal information, such as the publication date of a document, is of major relevance in a number of domains, like historical linguistics and digital humanities (Niculae et al., 2014). This is arguably even more true for a wide range of information retrieval tasks, such as document exploration, similarity search, summarisation and clustering, where the temporal dimension plays a major role in improving search results (Alonso et al., 2007;Alonso et al., 2011).

Such information, however, is not always readily available and must therefore be inferred, relying either on qualitative or quantitative methods, if not both (Ciula, 2017). Nonetheless, despite their significance, methods for temporal text classification and automatic document dating are still rather unexplored compared to other text classification tasks (Niculae et al., 2014). This, however, is most likely bound to change as the increasing availability of large-scale, time-annotated digital resources, such as Google n-grams1 , is promoting research in this direction. Two recent examples of this new trend, in line with the present task, are the Diachronic Text Evaluation shared task organised by Popescu et al. (2015) at SemEval 2015 and the RetroC Challenge presented by Graliński et al. (2017).

In this work we propose a simple, yet effective, approach for automatic document dating based on a linear multi-class Support Vector Machine classifier, trained on a combination of character and word n-grams, as well as document length in word tokens.

The solution is evaluated in the context of the DaDoEval -Dating Document Evaluationshared task at EVALITA 2020 (Menini et al., 2020;Basile et al., 2020). The task is based on the Alcide De Gasperi's corpus of public documents (Tonelli et al., 2019) and is organised into six sub-tasks: (I) coarse-grained classification on same-genre data, (II) coarse-grained classification on cross-genre data, (III) fine-grained classification on same-genre data, (IV) fine-grained classification on cross-genre data, (V) year-based classification on same-genre data, (VI) year-based classification on cross-genre data.

The proposed solution tackles the first two subtasks, coarse-grained classification on same-genre and cross-genre data. Both sub-tasks require to correctly assign document samples to one of the main five time periods identified in De Gasperi's political life, spanning a range of over fifty years from 1901 to 1954.

The paper is structured as follows: in section 2 we provide a brief overview of the training data set, in section 3 we go over the system setup and describe the feature space, section 4 is dedicated to results analysis and discussion, in section 5 we 1901-19181919-19261927-19421943-19471948-1954 SAMPLES SAMPLES

Data

The training data set released for the shared task includes 2,210 document samples extracted from the Alcide De Gasperi's corpus of public documents, a multi-genre collection of 2,759 texts written or transcribed between 1901 and 1954 (Tonelli et al., 2019).

With respect to the coarse-grained classification sub-tasks, the given samples are organised into five classes (see Table 1 A preliminary analysis of the data set reveals an imbalanced class distribution, with a significantly lower number of samples in the third class, corresponding to the 1927-1942 interval. This, however, is partially mitigated by the markedly higher average number of word tokens per sample observed in this class compared to the other ones.

System Description

The proposed solution is based on a Support Vector Machine (SVM) classifier implemented using the Scikit-learn library (Pedregosa et al., 2011).

To account for the rather imbalanced data set, the SVM is tuned in such a way that classes are assigned weights inversely proportional to their frequency in the input data.

Following the assumption that most text categorisation problems are linearly separable (Joachims, 1998) the model uses a linear kernel implemented in terms of libsvm (Chang and Lin, 2011) while relying on a one-versus-one decision strategy to handle both sub-tasks as multi-class, single label, classification problems.

Feature space

The system relies solely on the data provided by the task organisers and is split into training set (80%) and development set (20%). No preprocessing is applied, as measures such as case normalisation and punctuation removal do not seem to improve the classification result on the development set, but rather to worsen it.

Each document in the data set is represented using three sets of features: document length in terms of word tokens as well as character and word n-grams. In this respect, we explore the idea that SVMs trained on combinations of character and word n-grams are particularly effective in tackling text classification tasks (C ¸öltekin and Rama, 2017;C ¸öltekin and Rama, 2018).

Character n-grams are extracted for n ∈ {3, 4, 5} and span across word boundaries, thus capturing punctuation and space characters occurring at the beginning and at the end of each word token. Word n-grams, on the other hand, are extracted for n ∈ {1, 2}. Both feature sets are weighted using term-frequency, inverse-document frequency (TF-IDF) to scale down the impact of the most frequent n-grams.

The number of word tokens per document is computed in a naive way, splitting each sample at every white space. Similarly to n-gram features, tokens count are scaled down to a 0-1 range in an attempt to avoid numerical problems and prevent features in higher numeric ranges from dominating those in smaller ones (Hsu et al., 2003).

Optimisation and Tuning

The system hyper-parameters are optimised to obtain the best F1 score on the development set.

A subset of the hyper-parameters is tuned empirically through several experiments or on the basis of existing literature. This is the case for kernel type, decision strategy, class balancing, tolerance for stopping criterion (tol) and n-grams size.

The remaining hyper-parameters considered during optimisation are the regularisation param-eter (C) together with the maximum and minimum document frequency (max df, min df), which in the present approach are used to set an acceptance threshold for high and low frequency ngrams.

COMPONENT PARAMETER VALUE

TfidfVectorizer These hyper-parameters are tuned through the BayesSearchCV algorithm implemented in the scikit-optimize library (Head et al., 2020), using a 5-fold-shuffled cross validation. BayesSearchCV relies on Bayesian Optimisation and explores the hyper-parameters search space exploiting the information available from previous evaluations. This is in contrast to other approaches, such as grid and random search, which move across the search space either in an exhaustive or completely random manner.

Table 2 summarises the best hyper-parameters setup obtained from the tuning process.

Results

In this section we present the results for the two sub-tasks the system participated to. Results are summarised in Table 3 and reported in terms of macro-average F1 score.

The system ranked first both in the same-genre and in the cross-genre coarse-grained classification task, obtaining a macro-average F1 score of 0.934 and 0.413, respectively.

SUB-TASK TEAM

RUN MACRO F1

Classification on same-genre data

The runs submitted for the first sub-task are based on test samples of the same genre as the ones in the training set. The system scored well above the baseline, which was computed with a Logistic Regression model trained on TF-IDF-weighted word unigrams, without performing any preprocessing.

Overall, the results registered on the test set are in line with those observed during training. This is confirmed by the data summarised in Table 4 and by the confusion matrix in Figure 1.

The confusion matrix depicts a run on the development set which achieved a macro-average F1 score of 0.95, while Table 4 reports the perclass results of the best test run submitted for the sub-task. In both cases 1919-1926, 1943-1947 and 1948-1954 1 9 0 1 -1 9 1 8 1 9 1 9 -1 9 2 6 1 9 2 7 -1 9 4 2 1 9 4 3 -1 9 4 7 1 9 4 8 -1 9 5 4 Predicted label 1901-1918 1919-1926 1927-1942 1943-1947 1948-1954 True label

Classification on cross-genre data

The runs submitted for the second sub-task are based on samples coming from a cross-genre, outof-domain test data set. These samples are a subset of the documents collected for the Epistolario project (Tonelli et al., 2020) As expected, despite scoring above the baseline, cross-genre results are significantly lower than those obtained in the same-genre task. Perclass results summarised in Table 5 show how promising system performances registered in the same-genre task do not transfer to the cross-genre one, suggesting a poor ability of the model to generalise. Particularly interesting and worth investigating are the results registered for the third class, corresponding to the 1927-1942 interval. With respect to this class precision and recall values are equal to 0, indicating that model did not recognise any sample as belonging to this time period.

Possible improvements

Results for the same-genre task are quite encouraging and in line with those obtained on the development set, where the F1 score ranges between 0.92 and 0.96. However, with the current data and setup, there might not be much room for further improvement. Nonetheless, additional features like richness measures and linguistically motivated features (e.g. POS tags) are explored in other contributions ( Štajner and Zampieri, 2013;Zampieri et al., 2016) and could help achieve more stable results.

On the other hand, results for the second subtask suggest a lack of generalisation on crossgenre, out-of-domain data. In this respect, even though SVM-based systems for text classification should be able to perform well and take advantage of high dimensional feature spaces (Joachims, 1998), it might still be worthwhile experimenting with some feature selection methods. Another angle worth considering is that the system might be too sensitive to the shallow n-gram features used to represent the training data. In this case, including deeper text features, such as those encoding syntactic information, might help the system to abstract away from the lexical level. A first step in this direction is attempted by Szymanski and Lynch (2015) who employ Google Syntactic N-grams in an SVM-based system that participated to the Diachronic Text Evaluation shared task (Popescu et al., 2015) at SemEval 2015.

Conclusions

In this paper we describe a simple, yet effective, approach for automatic document dating implemented for the DaDoEval shared task at EVALITA 2020. The system is based on a linear Support Vector Machine and is trained on a small set of stylistic and lexical features, resulting in a fast and efficient classification model.

In particular, the approach achieves top scores in both coarse-grained classification sub-tasks, thus confirming that SVM-based systems trained on character and word n-grams are indeed well suited to tackle text classification problems.

Nonetheless, results observed in the second task suggest that the model does not generalise well on cross-genre data, leaving room for further improvements.

) corresponding to the main time periods historians identified in De Gasperi's political life: Habsburg years 1901-1918, Beginning of political activity 1919-1926, Internal exile 1927-1942, From fascism to the Italian Republic 1943-1947, Building the Italian Re-public 1948-1954.

Figure 1 :1Figure 1: Confusion matrix for a development set run with a macro-average F1 score of 0.95.

Table 1 :1Training set overview, showing the number of document samples per class and the average number of word tokens per sample, rounded up to the nearest integer.PER CLASS572342150514632AVG. SAMPLE LENGTH867103330446331209consider possible improvements while section 6 isreserved for final remarks.

Table 2 :2Final hyper-parameters setup for each system component.analyzerwordmax df0.9min df0.004ngram range(1, 2)lowercaseFalseTfidfVectorizer analyzercharmax df0.3min df0.001ngram range(3, 5)lowercaseFalseSVMkernellineardecision function ovotol1e-12C0.881class weightbalanced

Table 4 :4are the classes showing the highest number of misclassifications and, incidentally, are also the ones corresponding to the shortest time periods. Per-class results of the best test run for sub-task 1.CLASSPRECISION RECALLF11901-19180.9140.9860.9481919-19260.960.8720.9131927-19420.9730.9730.9731943-19470.8980.8980.8981948-19540.9390.9330.936

Table 5 :5, an ongoing effort to create a digital archive of Alcide De Gasperi's private and public correspondence. Per-class results of the best test run for sub-task 2.CLASSPRECISION RECALLF11901-19180.5830.70.6361919-19261.00.150.2611927-19420.00.00.01943-19470.60.750.6671948-19540.3540.850.5

http://books.google.com/ngrams

Acknowledgments

We thank Dr. C ¸agrı C ¸öltekin for his patient encouragement and valuable suggestions throughout this project.

Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian DaniloValerio Basile MariaCroce LuciaCDi Maro Passaro Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop

EVALITA

2020. 2020 Digital palaeography: What is digital about it? Digital Scholarship in the AriannaCiula Humanities 32 2 2017 Tübingen-oslo at SemEval-2018 task 2: SVMs perform better than RNNs in emoji prediction C¸agrı C ¸öltekin TarakaRama Proceedings of The 12th International Workshop on Semantic Evaluation The 12th International Workshop on Semantic Evaluation 2018 Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing C¸agrı C ¸öltekin TarakaRama Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects

Var-Dial

2017 LIBSVM: A library for support vector machines Chih-ChungChang Chih-JenLin ACM Transactions on Intelligent Systems and Technology 2 3 27 2011 A practical guide to support vector classification Chih-WeiHsu Chih-ChungChang Chih-JenLin 2003 Department of Computer Science, National Taiwan University Technical report Scikit-learn: Machine Learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 The RetroC Challenge: How to Guess the Publication Year of a Text? FilipGraliński RafałJaworski ŁukaszBorchmann PiotrWierzchoń Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage the 2nd International Conference on Digital Access to Textual Cultural Heritage 2017 Modeling Language Change in Historical Corpora: The Case of Portuguese MarcosZampieri ShervinMalmasi MarkDras Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC'16) the 10th International Conference on Language Resources and Evaluation (LREC'16) 2016 Semeval 2015, task 7: Diachronic text evaluation OctavianPopescu CarloStrapparava Proceedings of the 9th International Workshop on Semantic Evaluation the 9th International Workshop on Semantic Evaluation

SemEval

2015. 2015 Temporal Information Retrieval: Challenges and Opportunities OmarAlonso StrötgenJannik BaezaYRicardo GertzMichael Proceedings of the 1st International Temporal Web Analytics Workshop the 1st International Temporal Web Analytics Workshop 2011 11 On the value of temporal information in information retrieval OmarAlonso GertzMichael BaezaYRicardo SIGIR Forum 41 2007 Stylistic Changes for Temporal Text Classification SanjaŠtajner MarcosZampieri Proceedings of the 16th International Conference on Text, Speech and Dialogue (TSD) Lecture Notes in Artificial Intelligence -LNAI the 16th International Conference on Text, Speech and Dialogue (TSD) Springer 2013 8082 Prendo la Parola in Questo Consesso Mondiale: A Multi-Genre 20th Century Corpus in the Political Domain SaraTonelli RacheleSprugnoli GiovanniMoretti Proceedings of CLIC-it 2019 CLIC-it 2019 2019 Epistolario De Gasperi: National Edition of De Gasperi's Letters in Digital Format SaraTonelli RacheleSprugnoli GiovanniMoretti StefanoMalfatti MarcoOdorizzi Proceedings of AIUCD AIUCD 2020 DaDoEval @ EVALITA 2020: Same-Genre and Cross-Genre Dating of Historical Documents StefanoMenini GiovanniMoretti RacheleSprugnoli SaraTonelli Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop

EVALITA

2020. 2020 UCD: Diachronic Text Classification with Character, Word, and Syntactic N-grams TerrenceSzymanski GerardLynch Proceedings of the 9th International Workshop on Semantic Evaluation the 9th International Workshop on Semantic Evaluation

SemEval

2015. 2015 Text categorization with support vector machines: Learning with many relevant features ThorstenJoachims Proceedings of the 10th European Conference on Machine Learning (ECML'98) the 10th European Conference on Machine Learning (ECML'98) 1998 1398 TimHead ManojKumar HolgerNahrstaedt GillesLouppe IaroslavShcherbatyi 10.5281/zenodo.4014775 scikit-optimize/scikit-optimize 2020 8 Version v0 Temporal Text Ranking and Automatic Dating of Texts MarcosVlad Niculae LiviuZampieri AlinaMDinu Ciobanu Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics the 14th Conference of the European Chapter of the Association for Computational Linguistics 2014 2