<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>rmassidda @ DaDoEval: Document Dating Using Sentence Embeddings at EVALITA 2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riccardo Massidda</string-name>
          <email>r.massidda@studenti.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universita` di Pisa</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>This report describes an approach to solve the DaDoEval document dating subtasks for the EVALITA 2020 competition. The dating problem is tackled as a classification problem, where the significant length of the documents in the provided dataset is addressed by using sentence embeddings in a hierarchical architecture. Three different pre-trained models to generate sentence embeddings have been evaluated and compared: USE, LaBSE and SBERT. Other than sentence embeddings the classifier exploits a bag-of-entities representation of the document, generated using a pre-trained named entity recognizer. The final model is able to simultaneously produce the required date for each subtask.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        To solve the DaDoEval task
        <xref ref-type="bibr" rid="ref11">(Menini et al., 2020)</xref>
        for the EVALITA 2020 competition
        <xref ref-type="bibr" rid="ref1">(Basile et al.,
2020)</xref>
        a model should be able to assign a temporal
span from a discrete set of candidates to a
document, i.e. recognizing when the document was
issued. As many other NLP tasks, like author
identification or topic assignment, this task can be
reduced to a classification problem.
      </p>
      <p>The provided dataset contains documents
written by the Italian statesman Alcide De Gasperi in
the time span 1901-1954, labeled with the year in
which they were issued. The dating task is
divided into different subtasks of increasing
granularity. The first subtask requires to classify a
document into one of five representative periods in De
Gasperi’s life as identified by historians. (Table 1)
The second and the third subtasks require to date
a document more precisely, using a five-year span
for the former and the precise year for the latter.
These subtasks are referred to as the same-genre
subtasks.</p>
      <p>ID
A
B
C
D
E</p>
      <p>Period description Time span</p>
      <p>Habsburg years 1901-1918
Beginning of political activity 1919-1926</p>
      <p>Internal exile 1927-1942
From fascism to the Italian Republic 1943-1947</p>
      <p>Building the Italian Republic 1948-1954
Other than on a blind test set kept from the
same-genre dataset, the model has been also
evaluated on three additional cross-genre subtasks. In
this case, documents coming from a De Gasperi’s
epistolary archive were used to build an external
blind test set. The cross-genre subtasks require to
classify documents with the same increasing time
granularity as the same-genre ones.</p>
      <p>The tasks are evaluated using macro-averaged
F1. Baseline results using logistic regression and
tf-idf on a bag-of-word representation are
provided by the task proponents in table 2.</p>
      <sec id="sec-1-1">
        <title>Macro-Average F1 0.827 0.485 0.126</title>
        <p>All of the results and the described experiments
have been implemented using TensorFlow and
executed on the platform Google Colab. The
limitations of the platform regarding continuous
usage are not negligible and had an acknowledgeable
weight in multiple decisions.</p>
        <p>In section 2 different approaches to deal with
long text classification are described and the
various sentence embeddings models are presented.
In section 3 the peculiarities of the dataset are
discussed. In section 4 the different sentence
embeddings models are evaluated and compared with
alternative approaches over a single subtask. In
section 5 the architecture of the final model used to
solve all the subtasks is described, its results are
reported in section 6 and discussed in section 7.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Methodological survey</title>
      <p>
        The use of pre-trained transformers such as BERT
        <xref ref-type="bibr" rid="ref6">(Devlin et al., 2019)</xref>
        has remarkably improved the
state of the art in many NLP tasks, text
classification included. Furthermore contextual word
embeddings produced by pre-trained transformers are
preferable when dealing with polysemy.
Documents from a wide time span could manifest
lexical change, so polysemy may significantly emerge
        <xref ref-type="bibr" rid="ref3">(Blank, 1999)</xref>
        .
      </p>
      <p>When dealing with text classification using the
transformer model the first architectural issue is
given by the length of the documents. To classify a
text a special symbol is usually inserted at the start
of the input sequence, then the output
corresponding to that symbol is fed into a neural network to
retrieve the predicted class. Since the maximum
input size for a BERT transformer is 512 tokens, it
is unlikely that the whole document will fit.
Different architectures are available to overcome this
problem.</p>
      <p>
        For certain domains it has been studied that not
all of the text is needed to achieve good
classification accuracy. For instance Sun et al. (2020)
propose to select only part of the text, like the
head, or the tail or both, up to reducing the text
size to fit the input layer of the transformer. The
random selection of tokens inside a document has
also proven to be effective for topic classification
of academic papers
        <xref ref-type="bibr" rid="ref10">(Liu et al., 2018)</xref>
        .
      </p>
      <p>Recently different solutions started to exploit
hierarchical architectures, segmenting the text to
consequently analyze it in its entirety. The use
of sentences may be intuitively perceived as more
meaningful than fixed-length segments.
Accordingly, three different sentence embeddings
solutions have been selected to be implemented and
evaluated for the DaDoEval task. All of them
provide pre-trained multilingual models, satisfying so
the computational constraints and the task
requirements.</p>
      <p>
        Sentence-BERT, also known as SBERT,
produces sentence embeddings by stacking a
pooling layer on the top of a BERT transformer.
A pre-trained BERT model is fine-tuned using
Siamese networks, back-propagating over the
cosine similarity of supposedly semantically
related sentences.
        <xref ref-type="bibr" rid="ref12">(Reimers and Gurevych,
2019)</xref>
        A monolingual model can be then distilled
and expanded to other languages by training
a student model to replicate the behavior of
the teacher model, and under the assumption
that the vector representation of translated
sentences should coincide.
        <xref ref-type="bibr" rid="ref11 ref13 ref14">(Reimers and Gurevych,
2020)</xref>
        . The authors of SBERT published
distiluse-base-multilingual-cased,
a distilled model pre-trained on many languages
including Italian.
      </p>
      <p>
        The Universal Sentence Encoder, or USE,
comprises different architectures trained on the same
set of tasks to enable transfer learning for many
NLP tasks with different requirements.
        <xref ref-type="bibr" rid="ref4">(Cer et al.,
2018)</xref>
        The original USE has then been expanded
for multilingual applications providing two
pretrained models, a transformer and a CNN, both
available on Tensorflow HUB. (Yang et al., 2019)
      </p>
      <p>
        Lastly, the Language-agnostic BERT Sentence
Embedding model, or LaBSE, produces
sentence embeddings by using a fine-tuned BERT
model. The LaBSE model is designed similarly
to SBERT, using two sharing-weights
transformers initialized by a pre-trained BERT model. The
main difference lies in the datasets and the tasks
used for fine-tuning. The authors report the
remarkable results of LaBSE for languages unseen
but somehow related to those in the training set.
        <xref ref-type="bibr" rid="ref8">(Feng et al., 2020)</xref>
        This result may be useful to
fill the gaps between contemporary Italian and the
XX-century Italian language in the dataset.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data Analysis</title>
      <p>The overall dataset contains 2759 manually
labeled documents of variable length written by
Alcide De Gasperi during its political life. However,
the development dataset provided by the
proponents contains only 2210 of them, since the
remaining ones are kept for the blind same-genre
test set. The dataset is extremely unbalanced since
the number of elements per time period varies
considerably. For instance by analyzing figure 1 it is
evident how some years contribute to the dataset
with few documents. The lack of data for these
periods remarkably impacts the overall accuracy
of the learning process. The development set
provided by the proposers has been split into a
training set and a validation set to assess the
capabilities of the different tested models. The training set
was composed by sampling the 80% of the
development dataset, leaving the remaining 20% to the
validation split. This choice reflects the proportion
between the size of the provided development set
and the overall dataset.</p>
      <p>Without altering the validation split for the
assessment, the training data can be augmented to
contrast the unbalancing. The hierarchical
solution highly increases the number of tokens that
can be used to classify a document, nonetheless
the number of sentences per document should be
constrained under a fixed constant. When
truncating a document to limit the number of sentences,
the remaining part is then inserted in the dataset
as a new document instead of discarding it. The
data augmentation procedure described has been
implemented under the assumption that the less
represented years contain the longest documents.
While this holds for some classes, the effect of
data augmentation didn’t impact on the overall
distribution.</p>
      <sec id="sec-3-1">
        <title>Method</title>
        <p>SBERT
LaBSE</p>
      </sec>
      <sec id="sec-3-2">
        <title>USETRANS USECNN</title>
        <p>Time
223.068s
3364.272s
154.277s
29.681s</p>
        <p>
          The tokenizer for the Italian language included
in the NLTK library has been used to split each
document into a list of sentences
          <xref ref-type="bibr" rid="ref2">(Bird et al.,
2009)</xref>
          . The content of each sentence has been
tokenized instead with a custom tokenizer for each
one of the sentence embeddings techniques, since
they may require different configurations and their
vocabulary must be used. A common issue in this
scenario is given by the rate of out-of-vocabulary
tokens
          <xref ref-type="bibr" rid="ref15">(Wang et al., 2019)</xref>
          , but this hasn’t been
evaluated since the interfaces offered by the
selected models don’t offer insights over the OOV
rate or other token-level statistics. The time
required to produce the embeddings over the
training set is reported in table 3.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Building blocks selection</title>
      <p>Because of the computational limitations, many
experiments have been conducted only on one
subtask, relegating the others to a subsequent phase.
The historical subtask has been chosen because of
the better balancing of the dataset and the
foreseeable and more promising results. The provided
dataset has been split using stratified sampling and
data augmentation in a consistent training set and a
smaller validation set. The training split covers the
80% of the provided development set, leaving the
remaining 20% to the validation one. All of the
results are produced by averaging multiple runs,
to overcome the non-deterministic and
unpredictables effects of the GPUs used for training.
4.1</p>
      <sec id="sec-4-1">
        <title>Truncation based classification</title>
        <p>The first experiments used a pre-trained BERT
multilingual model for text classification. To
overcome the constraint over the input size the
documents were truncated up to their first 512 tokens.
As expected the truncation has proven to be
ineffective since, even after fine-tuning, the model
didn’t converge on the training set for any subtask.
The results aren’t significant and therefore not
reported.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Sentence embeddings</title>
        <p>Once each document is represented as a sequence
of sentence embeddings, two different
classification models have been implemented and
evaluated. The first is a Recurrent Neural Network
with two bidirectional LSTM layers followed by
a combination of dropout and dense layers of
reducing width. The other classifier is based on
the transformer architecture, where a transformer
block composed of a multi-headed self-attention
layer with 128 heads, dropout and layer
normalization is followed by a combination of dropout
and dense layers as in the previous solution.</p>
        <p>The results of the experiments over the
combination of sentence embeddings and the two
classifiers are reported in table 4, showing how the
combination of SBERT and the transformer-based
classifier is the most adequate. With the
exception of LaBSE, all the other sentence embeddings
models gave better results when coupled with a
transformer block than with a recurrent neural
network. Also, the two variants of USE manifested a
more significant gap when coupled with the RNN
classifier than with the transformer-based one.
Finally, the performance drop of the LaBSE model
may reflect a condition also explored by Reimers
and Gurevych (2020), where a comparable
performance gap with SBERT occurs in semantic textual
similarity tasks.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Bag-of-entities</title>
        <p>Another approach to tackle the subtasks
consists of exploiting the knowledge of a pre-trained
named entity recognizer. It is reasonable to
suppose that the entities extracted by a document
will produce a good representation for the
document itself. In the context of document dating
this could be meaningful by assuming that the
issues discussed by the author will vary during the
years, consequently influencing the entities
contained. By building a vocabulary of unique
entities it is possible to represent each document as a
bag-of-entities, then a multi-layer dense classifier
with dropout can be trained to predict the correct
time span.</p>
        <p>
          Named entity recognition is achieved using one
pre-trained CNN for the Italian language
distributed by spaCy
          <xref ref-type="bibr" rid="ref9">(Honnibal and Montani, 2017)</xref>
          .
Three variants of the same model are provided but,
since their differences heavily impact on the model
size rather than on the performances (Table 5), the
medium sized model has been chosen without
further validation. Because of this it is not possible
to assess how the performances of the NER alone
influence the performances of the overall system.
        </p>
        <p>The NER model returns for each entity a pair
containing its content and a label regarding its
role. It is possible to consider as a member of the
entities vocabulary only the textual content or the
unique pair of text and label, both methods were
implemented and compared but finally only the
label was chosen as representative of the entity.</p>
        <p>F1
Precision</p>
        <p>Recall
Size
The transformer classifier using sentence
embeddings provided by SBERT is chosen as the
final candidate since it’s the best performing model
on the validation set. As previously discussed,
the model selection procedure only considered the
first subtask because of the magnitude and the
balancing of its dataset. To roughly estimate the
behavior on all the subtasks both the sentence
embeddings classifier and the bag-of-entities solution
have been retrained from scratch on the specific
subtasks labels and evaluated on the validation set.
The results are reported in table 6.</p>
        <p>Task
Historical
Five-years
Single-year</p>
        <p>Baseline
0.827
0.485
0.126</p>
        <p>SBERT+Trans Bag-of-entities
TR VL TR VL
0.930 0.846 0.997 0.841
0.482 0.354 0.996 0.563
0.086 0.040 0.990 0.211
It is therefore clear that both the approaches have
their advantages on different subtasks. More
precisely the sentence embeddings one has proven to
be more effective when dealing with the historical
periods subtask, while the bag-of-entities obtains
better results on the finer ones. The problem of
combining these two solutions is now tackled.</p>
        <p>The trivial solution would be to hardwire in a
single model the different approaches, producing
so the output for the first subtask using a sentence
embeddings model and for the other subtasks with
a bag-of-entities one. While this solution would
be acceptable, and seemingly over the baseline
according to the estimates on the validation set, it is
reasonable to assume that the representations for
these subtasks could be shared, improving the
performances. Different variations of the same
architecture are therefore evaluated on the validation set
to monitor such improvement.</p>
        <p>In the final model, the sentence embeddings
produced by SBERT are fed to a transformer block
containing a multi-headed self-attention layer, its
output is then averaged and concatenated with the
bag-of-entities representation of the document
before being fed to a multi-layer neural network. The
output of each layer of this network is also fed to
a dedicated neural network that produces the
output of each subtask. The selected order for the
subtasks in the multi-layer dense classifier places
the historical classification first, followed by the
five-years and then the single-year classification.
A graphical representation of the architecture is in
figure 2.</p>
        <p>Both the reverse of the subtasks order and the
absence of hierarchy, by connecting all the
classification networks directly to the transformer block,
have been tested. Also, the supposed additional
value of the concatenation with the entities
representation has been experimentally evaluated. The
results of these variations are reported in table 7,
where the selected final model for the competition
BoE Order
N F
N B
N A
Y F
Y B
Y A
The model has been evaluated by using two
independent test sets: same-genre and cross-genre.
The first one is a blind test set, containing
documents from the same source of the provided
development dataset. The cross-genre set is instead an
external test set, containing documents from a
different source, specifically from an archive of
epistolary documents of the same subject.</p>
        <p>For each subtask two runs per test set were
submitted, for brevity in table 8 only the average
result of the submitted runs is reported. The model
performs over the baseline in the same-genre
evaluation for each subtask, also improving the
performances with respect to the validation set. Instead,
concerning the cross-genre evaluation, the model
replicates the results of the baseline and shows a
significant drop in respect to the validation set.</p>
        <p>Historical
Five-years
Single-year</p>
        <p>VL
0.842
0.599
0.236</p>
        <p>Same-genre
BL TS
0.827 0.857
0.458 0.609
0.126 0.265</p>
        <p>Cross-genre
BL TS
0.368 0.379
0.171 0.168
0.020 0.055
The contribution of the bag-of-entities
representation was certainly helpful, but this should not
overshadow the performance improvement given
by the introduction of the hierarchical model. The
first three rows in the already discussed table 7
report the results of the model without any
contribution from the bag-of-entities representation.
Whilst neither of these was elected as the best
candidate, there is a remarkable improvement over the
independent use of the very same building blocks
of the final architecture for each subtask.</p>
        <p>The described architecture is prone to multiple
variations and only some of them have been
formally evaluated and compared. Nonetheless, the
selected final model was able to surpass the
samegenre baseline for all of the different subtasks.
Anyhow the performance drop in the cross-genre
test should be interpreted as a limit to the
generalization power of the chosen model. A wider
exploration of the models may increase the
overall performances for both the same-genre and the
cross-genre tasks.</p>
        <p>Also, targeting multiple subtasks at the same
time made nontrivial the choice of a final model,
therefore it has been carried out intuitively
considering the results over the validation set for each
subtask. A formal approach to this issue may
result in a finer model selection.</p>
        <p>Despite the discussed approximations, the use
of sentence embeddings models has proven to be
effective also on tasks different from the ones
they were originally conceived for, and compatible
with other representations such as bag-of-entities.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          , Danilo Croce, Maria Di Maro, and
          <string-name>
            <surname>Lucia</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Evalita 2020: Overview of the 7th evaluation campaign of natural language processing and speech tools for italian</article-title>
          .
          <source>In Valerio Basile</source>
          , Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          , Ewan Klein, and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit</article-title>
          . ”
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.”, June. Google-Books-ID:
          <fpage>KGIbfiiP1i4C</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Blank</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change</article-title>
          .
          <source>Historical semantics and cognition</source>
          ,
          <volume>61</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Cer</surname>
          </string-name>
          , Yinfei Yang,
          <string-name>
            <surname>Sheng-yi Kong</surname>
          </string-name>
          , Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,
          <string-name>
            <surname>Yun-Hsuan</surname>
            <given-names>Sung</given-names>
          </string-name>
          , Brian Strope, and
          <string-name>
            <given-names>Ray</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          .
          <year>2018</year>
          . Universal Sentence Encoder.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          arXiv:
          <year>1803</year>
          .11175 [cs],
          <source>April</source>
          . arXiv:
          <year>1803</year>
          .11175.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . arXiv:
          <year>1810</year>
          .04805 [cs], May. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          Explosion.
          <source>ai</source>
          .
          <year>2020</year>
          .
          <article-title>Italian spaCy Models Documentation</article-title>
          . https://spacy.io/models/it.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Fangxiaoyu</given-names>
            <surname>Feng</surname>
          </string-name>
          , Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Language-agnostic BERT Sentence Embedding</article-title>
          . arXiv:
          <year>2007</year>
          .
          <year>01852</year>
          [cs],
          <source>July</source>
          . arXiv:
          <year>2007</year>
          .
          <year>01852</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Honnibal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ines</given-names>
            <surname>Montani</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Kaile Liu, Zhenghai Cong,
          <string-name>
            <given-names>Jiali</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Yefei</given-names>
            <surname>Ji</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jun</given-names>
            <surname>He</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Long Length Document Classification by Local Convolutional Feature Aggregation</article-title>
          . Algorithms,
          <volume>11</volume>
          (
          <issue>8</issue>
          ):
          <fpage>109</fpage>
          ,
          <string-name>
            <surname>August</surname>
          </string-name>
          . Number: 8 Publisher: Multidisciplinary Digital Publishing Institute.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Menini</surname>
          </string-name>
          , Giovanni Moretti, Rachele Sprugnoli, and
          <string-name>
            <given-names>Sara</given-names>
            <surname>Tonelli</surname>
          </string-name>
          .
          <year>2020</year>
          . DaDoEval @ EVALITA 2020:
          <article-title>Same-Genre and Cross-Genre Dating of Historical Documents</article-title>
          . In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors,
          <source>Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ),
          <article-title>Online</article-title>
          . CEUR.org.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>SentenceBERT: Sentence Embeddings using Siamese BERTNetworks</article-title>
          . arXiv:
          <year>1908</year>
          .10084 [cs],
          <year>August</year>
          . arXiv:
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation</article-title>
          . arXiv:
          <year>2004</year>
          .09813 [cs],
          <source>April</source>
          . arXiv:
          <year>2004</year>
          .09813.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Chi</given-names>
            <surname>Sun</surname>
          </string-name>
          , Xipeng Qiu, Yige Xu,
          <string-name>
            <given-names>and Xuanjing</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>How to Fine-Tune BERT for Text Classification</article-title>
          ? arXiv:
          <year>1905</year>
          .05583 [cs],
          <source>February</source>
          . arXiv:
          <year>1905</year>
          .05583.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Hai</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dian Yu</surname>
            , Kai Sun, Janshu Chen, and
            <given-names>Dong</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Improving Pre-Trained Multilingual Models with Vocabulary Expansion</article-title>
          . arXiv:
          <year>1909</year>
          .12440 [cs],
          <source>September</source>
          . arXiv:
          <year>1909</year>
          .12440.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Yinfei</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar,
          <string-name>
            <surname>Yun-Hsuan</surname>
            <given-names>Sung</given-names>
          </string-name>
          , Brian Strope, and
          <string-name>
            <given-names>Ray</given-names>
            <surname>Kurzweil</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Multilingual Universal Sentence Encoder for Semantic Retrieval</article-title>
          . arXiv:
          <year>1907</year>
          .04307 [cs],
          <source>July</source>
          . arXiv:
          <year>1907</year>
          .04307.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>