=Paper= {{Paper |id=Vol-2765/96 |storemode=property |title=matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating (short paper) |pdfUrl=https://ceur-ws.org/Vol-2765/paper96.pdf |volume=Vol-2765 |authors=Matteo Brivio |dblpUrl=https://dblp.org/rec/conf/evalita/Brivio20 }} ==matteo-brv @ DaDoEval: An SVM-based Approach for Automatic Document Dating (short paper)== https://ceur-ws.org/Vol-2765/paper96.pdf
      matteo-brv @ DaDoEval: An SVM-based Approach for Automatic
                           Document Dating

                                       Matteo Brivio
                                    University of Tübingen
                                   Department of Linguistics
                         matteo.brivio@student.uni-tuebingen.de


                        Abstract                                is most likely bound to change as the increasing
                                                                availability of large-scale, time-annotated digital
    English. This paper describes our con-                      resources, such as Google n-grams1 , is promoting
    tribution to the EVALITA 2020 shared                        research in this direction. Two recent examples of
    task DaDoEval – Dating Document Eval-                       this new trend, in line with the present task, are
    uation. The solution we present is based                    the Diachronic Text Evaluation shared task organ-
    on a linear multi-class Support Vector Ma-                  ised by Popescu et al. (2015) at SemEval 2015 and
    chine classifier trained on a combination                   the RetroC Challenge presented by Graliński et al.
    of character and word n-grams, as well                      (2017).
    as number of word tokens per document.                         In this work we propose a simple, yet effective,
    Despite its simplicity, the system ranked                   approach for automatic document dating based on
    first both in the coarse-grained classifica-                a linear multi-class Support Vector Machine clas-
    tion task on same-genre data and in the one                 sifier, trained on a combination of character and
    on cross-genre data, achieving a macro-                     word n-grams, as well as document length in word
    average F1 score of 0.934 and 0.413, re-                    tokens.
    spectively. The system implementation is                       The solution is evaluated in the context of
    available at https://github.com/                            the DaDoEval – Dating Document Evaluation –
    matteobrv/DaDoEval.                                         shared task at EVALITA 2020 (Menini et al.,
                                                                2020; Basile et al., 2020). The task is based on
1    Introduction                                               the Alcide De Gasperi’s corpus of public docu-
                                                                ments (Tonelli et al., 2019) and is organised into
Temporal information, such as the publication date
                                                                six sub-tasks: (I) coarse-grained classification on
of a document, is of major relevance in a number
                                                                same-genre data, (II) coarse-grained classification
of domains, like historical linguistics and digital
                                                                on cross-genre data, (III) fine-grained classifica-
humanities (Niculae et al., 2014). This is arguably
                                                                tion on same-genre data, (IV) fine-grained classifi-
even more true for a wide range of information re-
                                                                cation on cross-genre data, (V) year-based classi-
trieval tasks, such as document exploration, simi-
                                                                fication on same-genre data, (VI) year-based clas-
larity search, summarisation and clustering, where
                                                                sification on cross-genre data.
the temporal dimension plays a major role in im-
                                                                   The proposed solution tackles the first two sub-
proving search results (Alonso et al., 2007; Alonso
                                                                tasks, coarse-grained classification on same-genre
et al., 2011).
                                                                and cross-genre data. Both sub-tasks require to
   Such information, however, is not always read-
                                                                correctly assign document samples to one of the
ily available and must therefore be inferred, rely-
                                                                main five time periods identified in De Gasperi’s
ing either on qualitative or quantitative methods,
                                                                political life, spanning a range of over fifty years
if not both (Ciula, 2017). Nonetheless, despite
                                                                from 1901 to 1954.
their significance, methods for temporal text clas-
                                                                   The paper is structured as follows: in section 2
sification and automatic document dating are still
                                                                we provide a brief overview of the training data
rather unexplored compared to other text classifi-
                                                                set, in section 3 we go over the system setup and
cation tasks (Niculae et al., 2014). This, however,
                                                                describe the feature space, section 4 is dedicated
     Copyright © 2020 for this paper by its authors. Use per-   to results analysis and discussion, in section 5 we
mitted under Creative Commons License Attribution 4.0 In-
                                                                   1
ternational (CC BY 4.0).                                               http://books.google.com/ngrams
                                  1901-1918     1919-1926     1927-1942      1943-1947      1948-1954
     S AMPLES PER CLASS               572           342          150             514            632
     AVG . SAMPLE LENGTH              867          1033          3044            633           1209

Table 1: Training set overview, showing the number of document samples per class and the average
number of word tokens per sample, rounded up to the nearest integer.


consider possible improvements while section 6 is      3.1   Feature space
reserved for final remarks.
                                                       The system relies solely on the data provided by
                                                       the task organisers and is split into training set
2   Data
                                                       (80%) and development set (20%). No preprocess-
The training data set released for the shared task     ing is applied, as measures such as case normali-
includes 2,210 document samples extracted from         sation and punctuation removal do not seem to im-
the Alcide De Gasperi’s corpus of public docu-         prove the classification result on the development
ments, a multi-genre collection of 2,759 texts writ-   set, but rather to worsen it.
ten or transcribed between 1901 and 1954 (Tonelli         Each document in the data set is represented
et al., 2019).                                         using three sets of features: document length in
   With respect to the coarse-grained classifica-      terms of word tokens as well as character and word
tion sub-tasks, the given samples are organised        n-grams. In this respect, we explore the idea that
into five classes (see Table 1) corresponding to       SVMs trained on combinations of character and
the main time periods historians identified in De      word n-grams are particularly effective in tackling
Gasperi’s political life: Habsburg years 1901-         text classification tasks (Çöltekin and Rama, 2017;
1918, Beginning of political activity 1919-1926,       Çöltekin and Rama, 2018).
Internal exile 1927-1942, From fascism to the Ital-       Character n-grams are extracted for n ∈
ian Republic 1943-1947, Building the Italian Re-       {3, 4, 5} and span across word boundaries, thus
public 1948-1954.                                      capturing punctuation and space characters occur-
   A preliminary analysis of the data set reveals an   ring at the beginning and at the end of each word
imbalanced class distribution, with a significantly    token. Word n-grams, on the other hand, are ex-
lower number of samples in the third class, cor-       tracted for n ∈ {1, 2}. Both feature sets are
responding to the 1927-1942 interval. This, how-       weighted using term-frequency, inverse-document
ever, is partially mitigated by the markedly higher    frequency (TF-IDF) to scale down the impact of
average number of word tokens per sample ob-           the most frequent n-grams.
served in this class compared to the other ones.          The number of word tokens per document is
                                                       computed in a naive way, splitting each sample at
3   System Description                                 every white space. Similarly to n-gram features,
The proposed solution is based on a Support Vec-       tokens count are scaled down to a 0-1 range in an
tor Machine (SVM) classifier implemented using         attempt to avoid numerical problems and prevent
the Scikit-learn library (Pedregosa et al., 2011).     features in higher numeric ranges from dominat-
   To account for the rather imbalanced data set,      ing those in smaller ones (Hsu et al., 2003).
the SVM is tuned in such a way that classes are as-
                                                       3.2   Optimisation and Tuning
signed weights inversely proportional to their fre-
quency in the input data.                              The system hyper-parameters are optimised to ob-
   Following the assumption that most text             tain the best F1 score on the development set.
categorisation problems are linearly separa-              A subset of the hyper-parameters is tuned em-
ble (Joachims, 1998) the model uses a lin-             pirically through several experiments or on the ba-
ear kernel implemented in terms of libsvm              sis of existing literature. This is the case for kernel
(Chang and Lin, 2011) while relying on a               type, decision strategy, class balancing, tolerance
one-versus-one decision strategy to handle             for stopping criterion (tol) and n-grams size.
both sub-tasks as multi-class, single label, classi-      The remaining hyper-parameters considered
fication problems.                                     during optimisation are the regularisation param-
eter (C) together with the maximum and minimum        S UB - TASK    T EAM          RUN      M ACRO F1
document frequency (max df, min df), which
                                                      same-genre     matteo-brv       1         0.934
in the present approach are used to set an ac-
                                                                                      2         0.934
ceptance threshold for high and low frequency n-
                                                                     team 1           1         0.858
grams.
                                                                                      2         0.855
                                                                     baseline         -         0.827
    C OMPONENT        PARAMETER           VALUE
                                                      cross-genre    matteo-brv       1         0.413
    TfidfVectorizer   analyzer            word                                        2         0.413
                      max df              0.9                        team 1           1         0.392
                      min df              0.004                      baseline         -         0.368
                      ngram range         (1, 2)                     team 1           2         0.366
                      lowercase           False
    TfidfVectorizer   analyzer            char       Table 3: Final rankings for sub-task 1 and 2 in
                      max df              0.3        terms of macro-average F1 scores.
                      min df              0.001
                      ngram range         (3, 5)
                      lowercase           False      4.1   Classification on same-genre data
    SVM               kernel              linear
                                                     The runs submitted for the first sub-task are based
                      decision function   ovo
                                                     on test samples of the same genre as the ones in
                      tol                 1e-12
                                                     the training set. The system scored well above
                      C                   0.881
                                                     the baseline, which was computed with a Logistic
                      class weight        balanced
                                                     Regression model trained on TF-IDF-weighted
                                                     word unigrams, without performing any prepro-
Table 2: Final hyper-parameters setup for each
                                                     cessing.
system component.
                                                        Overall, the results registered on the test set are
   These hyper-parameters are tuned through          in line with those observed during training. This is
the BayesSearchCV algorithm implemented              confirmed by the data summarised in Table 4 and
in the scikit-optimize library (Head et al.,         by the confusion matrix in Figure 1.
2020), using a 5-fold-shuffled cross validation.        The confusion matrix depicts a run on the de-
BayesSearchCV relies on Bayesian Optimi-             velopment set which achieved a macro-average
sation and explores the hyper-parameters search      F1 score of 0.95, while Table 4 reports the per-
space exploiting the information available from      class results of the best test run submitted for the
previous evaluations. This is in contrast to other   sub-task. In both cases 1919-1926, 1943-1947
approaches, such as grid and random search,          and 1948-1954 are the classes showing the highest
which move across the search space either in an      number of misclassifications and, incidentally, are
exhaustive or completely random manner.              also the ones corresponding to the shortest time
   Table 2 summarises the best hyper-parameters      periods.
setup obtained from the tuning process.

                                                      C LASS         P RECISION      R ECALL        F1
4     Results
                                                      1901-1918          0.914         0.986       0.948
In this section we present the results for the two    1919-1926          0.96          0.872       0.913
sub-tasks the system participated to. Results are     1927-1942          0.973         0.973       0.973
summarised in Table 3 and reported in terms of        1943-1947          0.898         0.898       0.898
macro-average F1 score.                               1948-1954          0.939         0.933       0.936
   The system ranked first both in the same-genre
and in the cross-genre coarse-grained classifica-    Table 4: Per-class results of the best test run for
tion task, obtaining a macro-average F1 score of     sub-task 1.
0.934 and 0.413, respectively.
                                                                                 120   5   Possible improvements
             1901-1918   114         0          0          0          0                Results for the same-genre task are quite encour-
                                                                                 100
                                                                                       aging and in line with those obtained on the de-
             1919-1926     6        59          0          2          0                velopment set, where the F1 score ranges between
                                                                                 80
                                                                                       0.92 and 0.96. However, with the current data
True label




             1927-1942     1         0         31          0          0                and setup, there might not be much room for fur-
                                                                                 60
                                                                                       ther improvement. Nonetheless, additional fea-
             1943-1947     1         0          0          87         9                tures like richness measures and linguistically mo-
                                                                                 40
                                                                                       tivated features (e.g. POS tags) are explored in
             1948-1954     0         0          0          5         127               other contributions (Štajner and Zampieri, 2013;
                                                                                 20
                                                                                       Zampieri et al., 2016) and could help achieve more
                                                                                       stable results.
                         18


                                26


                                            42


                                                      47


                                                                  54             0
                     -19


                               -19


                                          -19


                                                    -19


                                                                -19
                    01


                               19


                                         27


                                                    43


                                                                48

                                                                                          On the other hand, results for the second sub-
                  19


                           19


                                     19


                                                19


                                                            19


                                         Predicted label                               task suggest a lack of generalisation on cross-
                                                                                       genre, out-of-domain data. In this respect, even
 Figure 1: Confusion matrix for a development set                                      though SVM-based systems for text classification
 run with a macro-average F1 score of 0.95.                                            should be able to perform well and take advan-
                                                                                       tage of high dimensional feature spaces (Joachims,
 4.2            Classification on cross-genre data                                     1998), it might still be worthwhile experimenting
                                                                                       with some feature selection methods. Another an-
 The runs submitted for the second sub-task are
                                                                                       gle worth considering is that the system might be
 based on samples coming from a cross-genre, out-
                                                                                       too sensitive to the shallow n-gram features used
 of-domain test data set. These samples are a sub-
                                                                                       to represent the training data. In this case, in-
 set of the documents collected for the Epistolario
                                                                                       cluding deeper text features, such as those encod-
 project (Tonelli et al., 2020), an ongoing effort to
                                                                                       ing syntactic information, might help the system
 create a digital archive of Alcide De Gasperi’s pri-
                                                                                       to abstract away from the lexical level. A first
 vate and public correspondence.
                                                                                       step in this direction is attempted by Szymanski
                                                                                       and Lynch (2015) who employ Google Syntac-
        C LASS                 P RECISION             R ECALL               F1         tic N-grams in an SVM-based system that partic-
        1901-1918                   0.583                  0.7             0.636       ipated to the Diachronic Text Evaluation shared
        1919-1926                   1.0                    0.15            0.261       task (Popescu et al., 2015) at SemEval 2015.
        1927-1942                   0.0                    0.0             0.0
        1943-1947                   0.6                    0.75            0.667       6   Conclusions
        1948-1954                   0.354                  0.85            0.5
                                                                                       In this paper we describe a simple, yet effective,
 Table 5: Per-class results of the best test run for
                                                                                       approach for automatic document dating imple-
 sub-task 2.
                                                                                       mented for the DaDoEval shared task at EVALITA
    As expected, despite scoring above the base-                                       2020. The system is based on a linear Support
 line, cross-genre results are significantly lower                                     Vector Machine and is trained on a small set of
 than those obtained in the same-genre task. Per-                                      stylistic and lexical features, resulting in a fast and
 class results summarised in Table 5 show how                                          efficient classification model.
 promising system performances registered in the                                          In particular, the approach achieves top scores
 same-genre task do not transfer to the cross-genre                                    in both coarse-grained classification sub-tasks,
 one, suggesting a poor ability of the model to gen-                                   thus confirming that SVM-based systems trained
 eralise. Particularly interesting and worth investi-                                  on character and word n-grams are indeed well
 gating are the results registered for the third class,                                suited to tackle text classification problems.
 corresponding to the 1927-1942 interval. With re-                                        Nonetheless, results observed in the second task
 spect to this class precision and recall values are                                   suggest that the model does not generalise well
 equal to 0, indicating that model did not recognise                                   on cross-genre data, leaving room for further im-
 any sample as belonging to this time period.                                          provements.
Acknowledgments                                          Octavian Popescu and Carlo Strapparava.         2015.
                                                           Semeval 2015, task 7: Diachronic text evaluation.
We thank Dr. Çağrı Çöltekin for his patient en-        In Proceedings of the 9th International Workshop on
couragement and valuable suggestions throughout            Semantic Evaluation (SemEval 2015), 870–878.
this project.                                            Omar Alonso, Strötgen Jannik, Baeza Y. Ricardo and
                                                          Gertz Michael. 2011. Temporal Information Re-
                                                          trieval: Challenges and Opportunities. In Proceed-
References                                                ings of the 1st International Temporal Web Analytics
                                                          Workshop, 11:1–8.
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-
  cia C. Passaro. 2020. Evalita 2020: Overview of        Omar Alonso, Gertz Michael and Baeza Y. Ricardo.
  the 7th evaluation campaign of natural language pro-    2007. On the value of temporal information in in-
  cessing and speech tools for italian. In Proceedings    formation retrieval. SIGIR Forum, 41:35–41.
  of Seventh Evaluation Campaign of Natural Lan-
  guage Processing and Speech Tools for Italian. Fi-     Sanja Štajner and Marcos Zampieri. 2013. Stylistic
  nal Workshop (EVALITA 2020), Online. CEUR.org.           Changes for Temporal Text Classification. In Pro-
                                                           ceedings of the 16th International Conference on
Arianna Ciula. 2017. Digital palaeography: What is         Text, Speech and Dialogue (TSD), Lecture Notes in
  digital about it? Digital Scholarship in the Humani-     Artificial Intelligence - LNAI 8082, Springer, 519-
  ties, 32(2):ii89–ii105.                                  526.
                                                         Sara Tonelli, Rachele Sprugnoli and Giovanni Moretti.
Çağrı Çöltekin, Taraka Rama. 2018. Tübingen-oslo
                                                           2019. Prendo la Parola in Questo Consesso Mon-
   at SemEval-2018 task 2: SVMs perform better than
                                                           diale: A Multi-Genre 20th Century Corpus in the
   RNNs in emoji prediction. In Proceedings of The
                                                           Political Domain. In Proceedings of CLIC-it 2019.
   12th International Workshop on Semantic Evalua-
   tion, 34-38.                                          Sara Tonelli, Rachele Sprugnoli, Giovanni Moretti,
                                                           Stefano Malfatti and Marco Odorizzi. 2020. Episto-
Çağrı Çöltekin, Taraka Rama. 2017. Tübingen sys-      lario De Gasperi: National Edition of De Gasperi’s
   tem in VarDial 2017 shared task: experiments with       Letters in Digital Format.     In Proceedings of
   language identification and cross-lingual parsing.      AIUCD.
   In Proceedings of the Fourth Workshop on NLP
   for Similar Languages, Varieties and Dialects (Var-   Stefano Menini, Giovanni Moretti, Rachele Sprugnoli
   Dial), 146-155.                                          and Sara Tonelli. 2020. DaDoEval @ EVALITA
                                                            2020: Same-Genre and Cross-Genre Dating of His-
Chih-chung Chang, Chih-jen Lin. 2011. LIBSVM:               torical Documents. In Proceedings of Seventh Eval-
  A library for support vector machines.        ACM         uation Campaign of Natural Language Process-
  Transactions on Intelligent Systems and Technology,       ing and Speech Tools for Italian. Final Workshop
  2(3):27:1–27:27.                                          (EVALITA 2020).

Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin.         Terrence Szymanski and Gerard Lynch.           2015.
  2003. A practical guide to support vector classifi-      UCD: Diachronic Text Classification with Charac-
  cation. Technical report, Department of Computer         ter, Word, and Syntactic N-grams. In Proceedings of
  Science, National Taiwan University.                     the 9th International Workshop on Semantic Evalu-
                                                           ation (SemEval 2015), 879–883.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
   B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,   Thorsten Joachims. 1998. Text categorization with
   R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.      support vector machines: Learning with many rel-
   Cournapeau, M. Brucher, M. Perrot and E. Duch-          evant features. In Proceedings of the 10th Euro-
   esnay. 2011. Scikit-learn: Machine Learning in          pean Conference on Machine Learning (ECML’98),
   Python. Journal of Machine Learning Research,           1398:137–142.
   12:2825–2830.
                                                         Tim Head, Manoj Kumar, Holger Nahrstaedt,
                                                           Gilles Louppe and Iaroslav Shcherbatyi. 2020.
Filip Graliński, Rafał Jaworski, Łukasz Borchmann
                                                           scikit-optimize/scikit-optimize (Version v0.8.1).
   and Piotr Wierzchoń. 2017. The RetroC Challenge:
                                                           Zenodo http://doi.org/10.5281/zenodo.4014775.
   How to Guess the Publication Year of a Text?. In
   Proceedings of the 2nd International Conference on    Vlad Niculae, Marcos Zampieri, Liviu Dinu and
   Digital Access to Textual Cultural Heritage, 29–34.     Alina M. Ciobanu. 2014. Temporal Text Ranking
                                                           and Automatic Dating of Texts. In Proceedings of
Marcos Zampieri, Shervin Malmasi and Mark Dras.            the 14th Conference of the European Chapter of the
 2016. Modeling Language Change in Historical              Association for Computational Linguistics (Volume
 Corpora: The Case of Portuguese. In Proceedings           2: Short Papers), 2:17–21.
 of the 10th International Conference on Language
 Resources and Evaluation (LREC’16), 4098–4104.