=Paper=
{{Paper
|id=Vol-2765/96
|storemode=property
|title=matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2765/paper96.pdf
|volume=Vol-2765
|authors=Matteo Brivio
|dblpUrl=https://dblp.org/rec/conf/evalita/Brivio20
}}
==matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating (short paper)==
matteo-brv @ DaDoEval: An SVM-based Approach for Automatic
Document Dating
Matteo Brivio
University of Tübingen
Department of Linguistics
matteo.brivio@student.uni-tuebingen.de
Abstract is most likely bound to change as the increasing
availability of large-scale, time-annotated digital
English. This paper describes our con- resources, such as Google n-grams1 , is promoting
tribution to the EVALITA 2020 shared research in this direction. Two recent examples of
task DaDoEval – Dating Document Eval- this new trend, in line with the present task, are
uation. The solution we present is based the Diachronic Text Evaluation shared task organ-
on a linear multi-class Support Vector Ma- ised by Popescu et al. (2015) at SemEval 2015 and
chine classifier trained on a combination the RetroC Challenge presented by Graliński et al.
of character and word n-grams, as well (2017).
as number of word tokens per document. In this work we propose a simple, yet effective,
Despite its simplicity, the system ranked approach for automatic document dating based on
first both in the coarse-grained classifica- a linear multi-class Support Vector Machine clas-
tion task on same-genre data and in the one sifier, trained on a combination of character and
on cross-genre data, achieving a macro- word n-grams, as well as document length in word
average F1 score of 0.934 and 0.413, re- tokens.
spectively. The system implementation is The solution is evaluated in the context of
available at https://github.com/ the DaDoEval – Dating Document Evaluation –
matteobrv/DaDoEval. shared task at EVALITA 2020 (Menini et al.,
2020; Basile et al., 2020). The task is based on
1 Introduction the Alcide De Gasperi’s corpus of public docu-
ments (Tonelli et al., 2019) and is organised into
Temporal information, such as the publication date
six sub-tasks: (I) coarse-grained classification on
of a document, is of major relevance in a number
same-genre data, (II) coarse-grained classification
of domains, like historical linguistics and digital
on cross-genre data, (III) fine-grained classifica-
humanities (Niculae et al., 2014). This is arguably
tion on same-genre data, (IV) fine-grained classifi-
even more true for a wide range of information re-
cation on cross-genre data, (V) year-based classi-
trieval tasks, such as document exploration, simi-
fication on same-genre data, (VI) year-based clas-
larity search, summarisation and clustering, where
sification on cross-genre data.
the temporal dimension plays a major role in im-
The proposed solution tackles the first two sub-
proving search results (Alonso et al., 2007; Alonso
tasks, coarse-grained classification on same-genre
et al., 2011).
and cross-genre data. Both sub-tasks require to
Such information, however, is not always read-
correctly assign document samples to one of the
ily available and must therefore be inferred, rely-
main five time periods identified in De Gasperi’s
ing either on qualitative or quantitative methods,
political life, spanning a range of over fifty years
if not both (Ciula, 2017). Nonetheless, despite
from 1901 to 1954.
their significance, methods for temporal text clas-
The paper is structured as follows: in section 2
sification and automatic document dating are still
we provide a brief overview of the training data
rather unexplored compared to other text classifi-
set, in section 3 we go over the system setup and
cation tasks (Niculae et al., 2014). This, however,
describe the feature space, section 4 is dedicated
Copyright © 2020 for this paper by its authors. Use per- to results analysis and discussion, in section 5 we
mitted under Creative Commons License Attribution 4.0 In-
1
ternational (CC BY 4.0). http://books.google.com/ngrams
1901-1918 1919-1926 1927-1942 1943-1947 1948-1954
S AMPLES PER CLASS 572 342 150 514 632
AVG . SAMPLE LENGTH 867 1033 3044 633 1209
Table 1: Training set overview, showing the number of document samples per class and the average
number of word tokens per sample, rounded up to the nearest integer.
consider possible improvements while section 6 is 3.1 Feature space
reserved for final remarks.
The system relies solely on the data provided by
the task organisers and is split into training set
2 Data
(80%) and development set (20%). No preprocess-
The training data set released for the shared task ing is applied, as measures such as case normali-
includes 2,210 document samples extracted from sation and punctuation removal do not seem to im-
the Alcide De Gasperi’s corpus of public docu- prove the classification result on the development
ments, a multi-genre collection of 2,759 texts writ- set, but rather to worsen it.
ten or transcribed between 1901 and 1954 (Tonelli Each document in the data set is represented
et al., 2019). using three sets of features: document length in
With respect to the coarse-grained classifica- terms of word tokens as well as character and word
tion sub-tasks, the given samples are organised n-grams. In this respect, we explore the idea that
into five classes (see Table 1) corresponding to SVMs trained on combinations of character and
the main time periods historians identified in De word n-grams are particularly effective in tackling
Gasperi’s political life: Habsburg years 1901- text classification tasks (Çöltekin and Rama, 2017;
1918, Beginning of political activity 1919-1926, Çöltekin and Rama, 2018).
Internal exile 1927-1942, From fascism to the Ital- Character n-grams are extracted for n ∈
ian Republic 1943-1947, Building the Italian Re- {3, 4, 5} and span across word boundaries, thus
public 1948-1954. capturing punctuation and space characters occur-
A preliminary analysis of the data set reveals an ring at the beginning and at the end of each word
imbalanced class distribution, with a significantly token. Word n-grams, on the other hand, are ex-
lower number of samples in the third class, cor- tracted for n ∈ {1, 2}. Both feature sets are
responding to the 1927-1942 interval. This, how- weighted using term-frequency, inverse-document
ever, is partially mitigated by the markedly higher frequency (TF-IDF) to scale down the impact of
average number of word tokens per sample ob- the most frequent n-grams.
served in this class compared to the other ones. The number of word tokens per document is
computed in a naive way, splitting each sample at
3 System Description every white space. Similarly to n-gram features,
The proposed solution is based on a Support Vec- tokens count are scaled down to a 0-1 range in an
tor Machine (SVM) classifier implemented using attempt to avoid numerical problems and prevent
the Scikit-learn library (Pedregosa et al., 2011). features in higher numeric ranges from dominat-
To account for the rather imbalanced data set, ing those in smaller ones (Hsu et al., 2003).
the SVM is tuned in such a way that classes are as-
3.2 Optimisation and Tuning
signed weights inversely proportional to their fre-
quency in the input data. The system hyper-parameters are optimised to ob-
Following the assumption that most text tain the best F1 score on the development set.
categorisation problems are linearly separa- A subset of the hyper-parameters is tuned em-
ble (Joachims, 1998) the model uses a lin- pirically through several experiments or on the ba-
ear kernel implemented in terms of libsvm sis of existing literature. This is the case for kernel
(Chang and Lin, 2011) while relying on a type, decision strategy, class balancing, tolerance
one-versus-one decision strategy to handle for stopping criterion (tol) and n-grams size.
both sub-tasks as multi-class, single label, classi- The remaining hyper-parameters considered
fication problems. during optimisation are the regularisation param-
eter (C) together with the maximum and minimum S UB - TASK T EAM RUN M ACRO F1
document frequency (max df, min df), which
same-genre matteo-brv 1 0.934
in the present approach are used to set an ac-
2 0.934
ceptance threshold for high and low frequency n-
team 1 1 0.858
grams.
2 0.855
baseline - 0.827
C OMPONENT PARAMETER VALUE
cross-genre matteo-brv 1 0.413
TfidfVectorizer analyzer word 2 0.413
max df 0.9 team 1 1 0.392
min df 0.004 baseline - 0.368
ngram range (1, 2) team 1 2 0.366
lowercase False
TfidfVectorizer analyzer char Table 3: Final rankings for sub-task 1 and 2 in
max df 0.3 terms of macro-average F1 scores.
min df 0.001
ngram range (3, 5)
lowercase False 4.1 Classification on same-genre data
SVM kernel linear
The runs submitted for the first sub-task are based
decision function ovo
on test samples of the same genre as the ones in
tol 1e-12
the training set. The system scored well above
C 0.881
the baseline, which was computed with a Logistic
class weight balanced
Regression model trained on TF-IDF-weighted
word unigrams, without performing any prepro-
Table 2: Final hyper-parameters setup for each
cessing.
system component.
Overall, the results registered on the test set are
These hyper-parameters are tuned through in line with those observed during training. This is
the BayesSearchCV algorithm implemented confirmed by the data summarised in Table 4 and
in the scikit-optimize library (Head et al., by the confusion matrix in Figure 1.
2020), using a 5-fold-shuffled cross validation. The confusion matrix depicts a run on the de-
BayesSearchCV relies on Bayesian Optimi- velopment set which achieved a macro-average
sation and explores the hyper-parameters search F1 score of 0.95, while Table 4 reports the per-
space exploiting the information available from class results of the best test run submitted for the
previous evaluations. This is in contrast to other sub-task. In both cases 1919-1926, 1943-1947
approaches, such as grid and random search, and 1948-1954 are the classes showing the highest
which move across the search space either in an number of misclassifications and, incidentally, are
exhaustive or completely random manner. also the ones corresponding to the shortest time
Table 2 summarises the best hyper-parameters periods.
setup obtained from the tuning process.
C LASS P RECISION R ECALL F1
4 Results
1901-1918 0.914 0.986 0.948
In this section we present the results for the two 1919-1926 0.96 0.872 0.913
sub-tasks the system participated to. Results are 1927-1942 0.973 0.973 0.973
summarised in Table 3 and reported in terms of 1943-1947 0.898 0.898 0.898
macro-average F1 score. 1948-1954 0.939 0.933 0.936
The system ranked first both in the same-genre
and in the cross-genre coarse-grained classifica- Table 4: Per-class results of the best test run for
tion task, obtaining a macro-average F1 score of sub-task 1.
0.934 and 0.413, respectively.
120 5 Possible improvements
1901-1918 114 0 0 0 0 Results for the same-genre task are quite encour-
100
aging and in line with those obtained on the de-
1919-1926 6 59 0 2 0 velopment set, where the F1 score ranges between
80
0.92 and 0.96. However, with the current data
True label
1927-1942 1 0 31 0 0 and setup, there might not be much room for fur-
60
ther improvement. Nonetheless, additional fea-
1943-1947 1 0 0 87 9 tures like richness measures and linguistically mo-
40
tivated features (e.g. POS tags) are explored in
1948-1954 0 0 0 5 127 other contributions (Štajner and Zampieri, 2013;
20
Zampieri et al., 2016) and could help achieve more
stable results.
18
26
42
47
54 0
-19
-19
-19
-19
-19
01
19
27
43
48
On the other hand, results for the second sub-
19
19
19
19
19
Predicted label task suggest a lack of generalisation on cross-
genre, out-of-domain data. In this respect, even
Figure 1: Confusion matrix for a development set though SVM-based systems for text classification
run with a macro-average F1 score of 0.95. should be able to perform well and take advan-
tage of high dimensional feature spaces (Joachims,
4.2 Classification on cross-genre data 1998), it might still be worthwhile experimenting
with some feature selection methods. Another an-
The runs submitted for the second sub-task are
gle worth considering is that the system might be
based on samples coming from a cross-genre, out-
too sensitive to the shallow n-gram features used
of-domain test data set. These samples are a sub-
to represent the training data. In this case, in-
set of the documents collected for the Epistolario
cluding deeper text features, such as those encod-
project (Tonelli et al., 2020), an ongoing effort to
ing syntactic information, might help the system
create a digital archive of Alcide De Gasperi’s pri-
to abstract away from the lexical level. A first
vate and public correspondence.
step in this direction is attempted by Szymanski
and Lynch (2015) who employ Google Syntac-
C LASS P RECISION R ECALL F1 tic N-grams in an SVM-based system that partic-
1901-1918 0.583 0.7 0.636 ipated to the Diachronic Text Evaluation shared
1919-1926 1.0 0.15 0.261 task (Popescu et al., 2015) at SemEval 2015.
1927-1942 0.0 0.0 0.0
1943-1947 0.6 0.75 0.667 6 Conclusions
1948-1954 0.354 0.85 0.5
In this paper we describe a simple, yet effective,
Table 5: Per-class results of the best test run for
approach for automatic document dating imple-
sub-task 2.
mented for the DaDoEval shared task at EVALITA
As expected, despite scoring above the base- 2020. The system is based on a linear Support
line, cross-genre results are significantly lower Vector Machine and is trained on a small set of
than those obtained in the same-genre task. Per- stylistic and lexical features, resulting in a fast and
class results summarised in Table 5 show how efficient classification model.
promising system performances registered in the In particular, the approach achieves top scores
same-genre task do not transfer to the cross-genre in both coarse-grained classification sub-tasks,
one, suggesting a poor ability of the model to gen- thus confirming that SVM-based systems trained
eralise. Particularly interesting and worth investi- on character and word n-grams are indeed well
gating are the results registered for the third class, suited to tackle text classification problems.
corresponding to the 1927-1942 interval. With re- Nonetheless, results observed in the second task
spect to this class precision and recall values are suggest that the model does not generalise well
equal to 0, indicating that model did not recognise on cross-genre data, leaving room for further im-
any sample as belonging to this time period. provements.
Acknowledgments Octavian Popescu and Carlo Strapparava. 2015.
Semeval 2015, task 7: Diachronic text evaluation.
We thank Dr. Çağrı Çöltekin for his patient en- In Proceedings of the 9th International Workshop on
couragement and valuable suggestions throughout Semantic Evaluation (SemEval 2015), 870–878.
this project. Omar Alonso, Strötgen Jannik, Baeza Y. Ricardo and
Gertz Michael. 2011. Temporal Information Re-
trieval: Challenges and Opportunities. In Proceed-
References ings of the 1st International Temporal Web Analytics
Workshop, 11:1–8.
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
cia C. Passaro. 2020. Evalita 2020: Overview of Omar Alonso, Gertz Michael and Baeza Y. Ricardo.
the 7th evaluation campaign of natural language pro- 2007. On the value of temporal information in in-
cessing and speech tools for italian. In Proceedings formation retrieval. SIGIR Forum, 41:35–41.
of Seventh Evaluation Campaign of Natural Lan-
guage Processing and Speech Tools for Italian. Fi- Sanja Štajner and Marcos Zampieri. 2013. Stylistic
nal Workshop (EVALITA 2020), Online. CEUR.org. Changes for Temporal Text Classification. In Pro-
ceedings of the 16th International Conference on
Arianna Ciula. 2017. Digital palaeography: What is Text, Speech and Dialogue (TSD), Lecture Notes in
digital about it? Digital Scholarship in the Humani- Artificial Intelligence - LNAI 8082, Springer, 519-
ties, 32(2):ii89–ii105. 526.
Sara Tonelli, Rachele Sprugnoli and Giovanni Moretti.
Çağrı Çöltekin, Taraka Rama. 2018. Tübingen-oslo
2019. Prendo la Parola in Questo Consesso Mon-
at SemEval-2018 task 2: SVMs perform better than
diale: A Multi-Genre 20th Century Corpus in the
RNNs in emoji prediction. In Proceedings of The
Political Domain. In Proceedings of CLIC-it 2019.
12th International Workshop on Semantic Evalua-
tion, 34-38. Sara Tonelli, Rachele Sprugnoli, Giovanni Moretti,
Stefano Malfatti and Marco Odorizzi. 2020. Episto-
Çağrı Çöltekin, Taraka Rama. 2017. Tübingen sys- lario De Gasperi: National Edition of De Gasperi’s
tem in VarDial 2017 shared task: experiments with Letters in Digital Format. In Proceedings of
language identification and cross-lingual parsing. AIUCD.
In Proceedings of the Fourth Workshop on NLP
for Similar Languages, Varieties and Dialects (Var- Stefano Menini, Giovanni Moretti, Rachele Sprugnoli
Dial), 146-155. and Sara Tonelli. 2020. DaDoEval @ EVALITA
2020: Same-Genre and Cross-Genre Dating of His-
Chih-chung Chang, Chih-jen Lin. 2011. LIBSVM: torical Documents. In Proceedings of Seventh Eval-
A library for support vector machines. ACM uation Campaign of Natural Language Process-
Transactions on Intelligent Systems and Technology, ing and Speech Tools for Italian. Final Workshop
2(3):27:1–27:27. (EVALITA 2020).
Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin. Terrence Szymanski and Gerard Lynch. 2015.
2003. A practical guide to support vector classifi- UCD: Diachronic Text Classification with Charac-
cation. Technical report, Department of Computer ter, Word, and Syntactic N-grams. In Proceedings of
Science, National Taiwan University. the 9th International Workshop on Semantic Evalu-
ation (SemEval 2015), 879–883.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Thorsten Joachims. 1998. Text categorization with
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. support vector machines: Learning with many rel-
Cournapeau, M. Brucher, M. Perrot and E. Duch- evant features. In Proceedings of the 10th Euro-
esnay. 2011. Scikit-learn: Machine Learning in pean Conference on Machine Learning (ECML’98),
Python. Journal of Machine Learning Research, 1398:137–142.
12:2825–2830.
Tim Head, Manoj Kumar, Holger Nahrstaedt,
Gilles Louppe and Iaroslav Shcherbatyi. 2020.
Filip Graliński, Rafał Jaworski, Łukasz Borchmann
scikit-optimize/scikit-optimize (Version v0.8.1).
and Piotr Wierzchoń. 2017. The RetroC Challenge:
Zenodo http://doi.org/10.5281/zenodo.4014775.
How to Guess the Publication Year of a Text?. In
Proceedings of the 2nd International Conference on Vlad Niculae, Marcos Zampieri, Liviu Dinu and
Digital Access to Textual Cultural Heritage, 29–34. Alina M. Ciobanu. 2014. Temporal Text Ranking
and Automatic Dating of Texts. In Proceedings of
Marcos Zampieri, Shervin Malmasi and Mark Dras. the 14th Conference of the European Chapter of the
2016. Modeling Language Change in Historical Association for Computational Linguistics (Volume
Corpora: The Case of Portuguese. In Proceedings 2: Short Papers), 2:17–21.
of the 10th International Conference on Language
Resources and Evaluation (LREC’16), 4098–4104.