=Paper=
{{Paper
|id=Vol-2481/paper12
|storemode=property
|title=Embeddings Shifts as Proxies for Different Word Use in Italian Newspapers
|pdfUrl=https://ceur-ws.org/Vol-2481/paper12.pdf
|volume=Vol-2481
|authors=Michele Cafagna,Lorenzo De Mattei,Malvina Nissim
|dblpUrl=https://dblp.org/rec/conf/clic-it/CafagnaMN19
}}
==Embeddings Shifts as Proxies for Different Word Use in Italian Newspapers==
Embeddings Shifts as Proxies for Different Word Use in Italian Newspapers
Michele Cafagna1,3 , Lorenzo De Mattei1,2,3 and Malvina Nissim3
1
Department of Computer Science, University of Pisa, Italy
2
ItaliaNLP Lab, ILC-CNR, Pisa, Italy
3
University of Groningen, The Netherlands
{m.cafagna,m.nissim}@rug.nl, {lorenzo.demattei}@di.unipi.it
Abstract gio Luigino single, è finita la Melodia
[en: Luigino single, the Melody is over]
We study how words are used differently (2) rep Mosca, “la baby sitter omicida non ha agito da
in two Italian newspapers at opposite ends sola”
of the political spectrum by training em- [en: Moscow, “the killer baby-sitter has not acted
alone”]
beddings on one newspaper’s corpus, up-
dating the weights on the second one, and gio Mosca, la donna killer: “Ho decapitato la bimba
observing vector shifts. We run two types perché me l’ha ordinato Allah”
[en: Moscow, the killer woman: “I have beheaded
of analysis, one top-down, based on a pre- the child because Allah has ordered me to do it”]
selection of frequent words in both news-
papers, and one bottom-up, on the basis of Often though, the same words are used, but with
a combination of the observed shifts and distinct nuances, or in combination with other, dif-
relative and absolute frequency. The anal- ferent words, as in Examples (3)–(4):
ysis is specific to this data, but the method
(3) rep Usa: agente uccide un nero disarmato e immobiliz-
can serve as a blueprint for similar studies. zato
[en: Usa: policeman kills an unarmed and
immobilised black guy]
gio Oklahoma, poliziotto uccide un nero disarmato:
1 Introduction and Background “Ho sbagliato pistola”
[en: Oklahoma: policeman kills an unarmed black
Different newspapers, especially if positioned at guy: “I used the wrong gun”]
opposite ends of the political spectrum, can render
(4) rep Corte Sudan annulla condanna, Meriam torna li-
the same event in different ways. In Example (1), bera
both headlines are about the leader of the Ital- [en: Sudan Court cancels the sentence, Meriam is
ian political movement “Cinque Stelle” splitting free again]
up with his girlfriend, but the Italian left-oriented gio Sudan, Meriam è libera: non sarà impiccata perché
newspaper la Repubblica1 (rep in the examples) cristiana
and right-oriented Il Giornale2 (gio in the ex- [en: Sudan: Meriam is free: she won’t be hanged
because Christian]
amples) describe the news quite differently. The
news in Example (2), which is about a baby-sitter In this work we discuss a method to study how the
killing a child in Moscow, is also reported by the same words are used differently in two sources,
two newspapers mentioning and stressing different exploiting vector shifts in embedding spaces.
aspects of the same event. The two embeddings models built on data com-
(1) rep La ex di Di Maio: “E’ stato un amore intenso ma ing from la Repubblica and Il Giornale might
non abbiamo retto allo stress della politica” contain interesting differences, but since they are
[en: The ex of Di Maio: “It’s been an intense love
relationship, but we haven’t survived the stress of separate spaces they are not directly comparable.
politics”] Previous work has encountered this issue from
a diachronic perspective: when studying mean-
Copyright c 2019 for this paper by its authors. Use ing shift in time, embeddings built on data from
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)
different periods would encode different usages,
1
https://www.repubblica.it but they need to be comparable. Instead of con-
2
http://www.ilgiornale.it structing separate spaces and then aligning them
(Hamilton et al., 2016b), we adopt the method corpus using the tf-idf vectors, we rank them and
used by Kim et al. (2014) and subsequently by Del then filter out alignments whose cosine similarity
Tredici et al. (2016) for Italian, whereby embed- is under a certain threshold. The threshold should
dings are first trained on a corpus, and then up- be chosen taking into consideration a trade-off be-
dated with a new one; observing the shifts certain tween keeping a sufficient number of documents
words undergo through the update is a rather suc- and quality of alignment. In this case, we are rel-
cessful method to proxy meaning change. atively happy with a good but not too strict align-
Rather than across time, we update embed- ment, and after a few tests and manual checks, we
dings across sources which are identical in genre found that threshold of 0.185 works well in prac-
(newspapers) but different in political positioning. tice for these datasets, yielding a good balance be-
Specifically, we train embeddings on articles com- tween correct alignments and news recall. Table 1
ing from the newspaper La Repubblica (leaning shows the size of the aligned corpus in terms of
left) and update them using articles coming from number of documents and tokens.
the newspaper Il Giornale (leaning right). We take
the observed shift of a given word (or the shift in newspaper #documents #tokens
distance between two words) as a proxy for a dif- la Repubblica 31,209 23,038,718
ference in usage of that term, running two types Il Giornale 38,984 18,584,121
of analysis. One is top-down, and focuses on a
set of specific words which are frequent in both Table 1: Size of the aligned corpus.
corpora. The other one is bottom-up, focusing on
words that result potentially interesting on the ba-
sis of measures that combine the observed shift 2.2 Shared lexicon
with both relative and absolute frequency. As a If we look at the most frequent content words in
byproduct, we also learn something about the in- the datasets (Figure 1), we see that they are indeed
teraction of shifts and frequency. very similar, most likely due to the datasets being
aligned based on lexical overlap.
2 Data This selection of frequent words already consti-
tutes a set of interesting tokens to study for their
We scraped articles from the online sites of the potential usage shift across the two newspapers.
Italian newspapers la Repubblica, and Il Giornale. In addition, through the updating procedure that
We concatenated each article to its headline, and we describe in the next section, we will be able to
obtained a total of 276,120 documents (202,419 identify which words appear to undergo the heav-
for Il Giornale and 73,701 for la Repubblica). iest shifts from the original to the updated space,
For training the two word embeddings, though, possibly indicating a substantial difference of use
we only used a selection of the data. Since we are across the two newspapers.
interested in studying how the usage of the same
words changes across the two newspapers, we 2.3 Distinguishability
wanted to maximise the chance of articles from the Seeing that frequent words are shared across the
two newspapers being on the same topic. Thus, we two datasets, we want to ensure that the two
implemented an automatic alignment, and retained datasets are still different enough to make the em-
only the aligned news for each of the two corpora. beddings update meaningful.
All embeddings are trained on such aligned news. We therefore run a simple classification ex-
periment to assess how distinguishable the two
2.1 Alignment sources are based on lexical features. Using the
We align the two datasets using the whole body of scikit-learn implementation with default parame-
the articles. We compute the tf-idf vectors for all ters (Pedregosa et al., 2011), we trained a binary
the articles of both newspapers and create subsets linear SVM to predict whether a given document
of relevant news filtering by date, i.e. consider- comes from la Repubblica or Il Giornale. We used
ing only news that were published in the range of ten-fold cross-validation over the aligned dataset
three days before and after of one another. Once with only word n-grams 1-2 as features and ob-
this subset is extracted, we compute cosine simi- tained an overall accuracy of 0.796, and 0.794 and
larities for all news in one corpus and in the other 0.797 average precision and recall, respectively.
Figure 1: Left: top 100 most frequent words in la Repubblica. Right: top 100 in Il Giornale.The words
are scaled proportionally to their frequency in the respective datasets.
This is indicative that the two newspapers can be
distinguished even when writing about the same
topics. Looking at predictive features we can in-
deed see some words that might be characterising
each of the newspapers due to their higher tf-idf
weight, thus maintaining distinctive context even
in similar topics and with frequent shared words.
3 Embeddings and Measures
We train embeddings on one source, and update
the weights training on the other source. Specif-
ically, using the gensim library (Řehůřek and
Sojka, 2010), first we train a word2vec model
(Mikolov et al., 2013) to learn 128 sized vectors on
la Repubblica corpus (using the skip-gram model,
Figure 2: Gap-Shift scatter plot of the words
window size of 5, high-frequency word downsam-
in the two newspapers. Darker colour indicates
ple rate of 1e-4, learning rate of 0.05 and mini-
a higher cumulative frequency; a negative gap
mum word frequency 3, for 15 iterations). We
means higher relative frequency in Il Giornale.
call these word embeddings spaceR. Next, we up-
date spaceR on the documents of Il Giornale with
identical settings but for 5 iterations rather than 15. way that the distance between two words is ap-
The resulting space, spaceRG, has a total vocab- proximated by the cosine distance of their vectors
ulary size of 53,684 words. We decided to go this (Turney and Pantel, 2010), we calculate the dis-
direction (rather than train on Il Giornale first and tance between a word in spaceR and the same
update on La Repubblica later because the La Re- word in spaceRG, by taking the norm of the dif-
pubblica corpus is larger in terms of tokens, thus ference between the vectors. This value for word
ensuring a more stable space to start from. w is referred to as shif tw . The higher shif tw , the
larger the difference in usage of w across the two
3.1 Quantifying the shift spaces. We observe an average shift of 1.98, with
This procedure makes it possible to observe the the highest value at 6.65.
shift of any given word, both quantitatively as well
as qualitatively. This is more powerful than build- 3.2 Frequency impact
ing two separate spaces and just check the nearest By looking at raw shifts, selecting high ones,
neighbours of a selection of words. In the same we could see some potentially interesting words.
Figure 3: Distance matrix between a small set Figure 4: Distance matrix between a small set of
of high frequency words on la Repubblica. The high frequency words after updating with Il Gior-
lighter the color the larger the distance. nale. The lighter the color the larger the distance.
4 Analysis
However, frequency plays an important role, too
(Schnabel et al., 2015). To account for this, we We use the information that derives from having
explore the impact of both absolute and relative the original spaceR and the updated spaceRG to
frequency for each word w. We take the overall carry out two types of analysis. The first one is
frequency of a word summing the individual oc- top-down, with a pre-selection of words to study,
currences of w in the two corpora (totalw ). We while the second one is bottom-up, based on mea-
also take the difference between the relative fre- sures combining the shift and frequency.
quency of a word in the two corpora, as this might
4.1 Top-down
be influencing the shift. We refer to this difference
as gapw , and calculate it as in Equation 1. As a first analysis, we look into the most frequent
words in both newspapers and study how their re-
lationships change when we move from spaceR to
f reqwr f reqwg
spaceRG. The words we analyse are the union of
(1) gapw = log( ) − log( )
|r| |g| those reported in Figure 1. Note that in this anal-
ysis we look at pairs of words at once, rather than
at the shift of a single word from one space to the
A negative gapw indicates that the word is rela-
next. We build three matrices to visualise the dis-
tively more frequent in Il Giornale than in la Re-
tance between these words.
pubblica, while a positive value indicates the op-
The first matrix (Figure 3) only considers
posite. Words whose relative frequency is similar
SpaceR, and serves to show how close/distant the
in both corpora exhibit values around 0.
words are from one another in la Repubblica. For
We observe a tiny but significant negative cor- example, we see that “partito” and “Pd”, or “pre-
relation between totalw and shif tw (-0.093, p < mier” and “Renzi” are close (dark-painted), while
0.0001), indicating that the more frequent a word, “polizia” and “europa” are lighter, thus more dis-
the less it is likely to shift. In Figure 2 we see all tant (probably used in different contexts).
the dark dots (most frequent words) concentrated In Figure 4 we show a replica of the first ma-
at the bottom of the scatter plot (lower shifts). trix, but now on SpaceRG; this matrix now let’s
However, when we consider gapw and shif tw , us see how the distance between pairs of words has
we see a more substantial negative correlation (- changed after updating the weights. Some vectors
0.306, p < 0.0001), suggesting that the gap has an are farther than before and this is visible by the
influence on the shift: the more negative the gap, ligther color of the figure, like “usa” and “lega”
the higher the shift. In other words, the shift is or “italia” and “usa”, while some words are closer
larger if a word is relatively more frequent in the like “Berlusconi” and “europa” or “europa” and
corpus used to update the embeddings. “politica” which feature darker colour. Specific
relative frequency does not change in the two
datasets, but have a high shift. Zooming in on the
words that have small gaps (−0.1 < gapw < 0.1),
will provide us with a set of potentially interest-
ing words, especially if they have a shift higher
than the average shift. We also require that words
obeying the previous constraints occur more than
the average word frequency over the two corpora.
Low frequency words are in general less stable
(Schnabel et al., 2015), suggesting that shifts for
the latter might not be reliable. High frequency
words shift globally less (cf. Figure 2), so a higher
than average shift could be meaningful.
Figure 5: Difference matrix between embeddings Figure 6 shows the plot of words that have
from spaceR and spaceRG normalised with the more or less the same relative frequency in the
logarithm of the absolute frequency difference in two newspapers (−0.1 < gap > 0.1 and an ab-
spaceRG. The lighter the colour, the larger the dis- solute cumulative frequency higher than average),
tance between pairs of words. and we therefore infer that their higher than aver-
age shift is mainly due to usage difference. Some
analysis of the co-occurrences of such words could comments are provided next to the plot.
yield interesting observations on their use in the These words can be the focus of a dedicated
two newspapers. study, and independently of the specific observa-
In order to better observe the actual difference, tions that we can make in this context, this method
the third matrix shows the shift from spaceR to can serve as a way to highlight the hotspot words
spaceRG, normalised by the logarithm of the ab- that deserve attention in a meaning shift study.
solute difference between the totalw1 and totalw2
(Figure 5).3 Lighter word-pairs shifted more, thus 4.3 A closer look at nearest neighbours
suggesting different contexts and usage, for exam- As a last, more qualitative, analysis, one can in-
ple “italia” and “lega”. Darker pairs, on the other spect how the nearest neighbours of a given word
hand, such as “Pd”-“Partito” are also interesting of interest change from one space to the next. In
for deeper analysis, since their joint usage is likely our specific case, we picked a few words (deriv-
to be quite similar in both newspapers. ing them from the top-down, thus most frequent,
and bottom-up selections), and report in Table 2
4.2 Bottom-up their top five nearest neighbours in SpaceR and in
Differently from what we did in the top-down SpaceRG. As in most analyses of this kind, one
analysis, here we do not look at how the relation- has to rely quite a bit on background and general
ship between pairs of pre-selected words changes, knowledge to interpret the changes. If we look at
rather at how a single word’s usage varies across “Renzi”, for example, a past Prime Minister from
the two spaces. These words arise from the in- the party close to the newspaper “la Repubblica”,
teraction of gap and shif t, which yields various we see that while in SpaceR the top neighbours
scenarios. Words with a large negative gap (rela- are all members of his own party, and the party
tive frequency higher in Il Giornale) are likely to itself (“Pd”), in SpaceRG politicians from other
shift more, but it’s probably more of an effect due parties (closer to “Il Giornale”) get closer to Renzi,
to increased frequency than a genuine shift. Words such as Berlusconi and Alfano.
that have a high gap (occurring relatively less in Il
Giornale) are likely to shift less, most likely since 5 Conclusions
adding a few contexts might not cause much shift.
The most interesting cases are words whose We experimented with using embeddings shifts as
a tool to study how words are used in two different
3
Note that this does not correspond exactly to the gap Italian newspapers. We focused on a pre-selection
measure in Eq. 1 since we are considering the difference be-
tween two words rather than the difference in occurrence of of high frequency words shared by the two news-
the same word in the two corpora. papers, and on another set of words which were
Figure 6: Gap-Shift scatter plot like in Figure 2, zoomed in the gap region -0.1 - 0.1 and shift greater than
1.978 (average shift). Only words with cumulative frequency higher than average frequency are plotted.
Beside the present showcase, we believe this
Table 2: A few significant words and their top 5
methodology can be more in general used to high-
nearest neighbours in SpaceR and SpaceRG.
light which words might deserve deeper, dedicated
analysis when studying meaning change.
SpaceR SpaceRG
One aspect that should be further investigated
“migranti” [en: migrants]
is the role played by the methodology used for
barconi [large boats] (0.60) eritrei [Eritreans] (0.61)
naufraghi [castaways] (0.57) Lampedusa [] (0.60) aligning and/or updating the embeddings. As an
disperati [wretches] (0.56) accoglienza [hospitality] (0.59) alternative to what we proposed, one could em-
barcone [large boat] (0.55) Pozzallo [] (0.58)
carrette [wrecks] (0.53) extracomunitari [non-European] (0.57) ploy different strategies to manipulate embedding
“Renzi ” [past Prime Minister] spaces towards highlighting meaning changes. For
Orfini [] (0.65) premier [] (0.60) example, Rodda et al. (2016) exploited Repre-
Letta [] (0.64) Nazareno [] (0.59)
Cuperlo [] (0.63) Berlusconi [] (0.58)
sentational Similarity Analysis (Kriegeskorte and
Pd [] (0.62) Cav [] (0.57) Kievit, 2013) to compare embeddings built on dif-
Bersani [] (0.61) Alfano [] (0.56) ferent spaces in the context of studying diachronic
“politica ” [en: politics]
semantic shifts in ancient Greek. Another inter-
leadership [] (0.65) tecnocrazia [technocracy] (0.60)
logica [logic] (0.64) democrazia [democracy] (0.59)
esting approach, still in the context of diachronic
miri [aspire to] (0.63) partitica [of party] (0.58) meaning change, but applicable to our datasets,
ambizione [ambition] (0.62) democratica [democratic] (0.57)
potentati [potentates] (0.61) legalità [legality] (0.56)
was introduced by Hamilton et al. (2016a), who
use both a global and a local neighborhood mea-
sure of semantic change to disentangle shifts due
to cultural changes from purely linguistic ones.
highlighted as potentially interesting through a
newly proposed methodology which combines ob- Acknowledgments
served embeddings shifts and relative and absolute
frequency. Most differently used words in the two We would like to thank the Center for Informa-
newspapers are proper nouns of politically active tion Technology of the University of Groningen
individuals as well as places, and concepts that are for providing access to the Peregrine high perfor-
highly debated on the political scene. mance computing cluster.
References Peter D Turney and Patrick Pantel. 2010. From fre-
quency to meaning: Vector space models of se-
Marco Del Tredici, Malvina Nissim, and Andrea Za- mantics. Journal of artificial intelligence research,
ninello. 2016. Tracing metaphors in time through 37:141–188.
self-distance in vector spaces. In Proceedings of the
Third Italian Conference on Computational Linguis-
tics (CLiC-it 2016).
William L Hamilton, Jure Leskovec, and Dan Jurafsky.
2016a. Cultural shift or linguistic drift? comparing
two computational measures of semantic change. In
Proceedings of the Conference on Empirical Meth-
ods in Natural Language Processing. Conference on
Empirical Methods in Natural Language Process-
ing, volume 2016, page 2116. NIH Public Access.
William L Hamilton, Jure Leskovec, and Dan Jurafsky.
2016b. Diachronic word embeddings reveal statisti-
cal laws of semantic change. In Proceedings of the
54th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), pages
1489–1501.
Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde,
and Slav Petrov. 2014. Temporal analysis of lan-
guage through neural language models. In Proceed-
ings of the ACL 2014 Workshop on Language Tech-
nologies and Computational Social Science, pages
61–65, Baltimore, MD, USA, June. Association for
Computational Linguistics.
Nikolaus Kriegeskorte and Rogier A Kievit. 2013.
Representational geometry: integrating cognition,
computation, and the brain. Trends in cognitive sci-
ences, 17(8):401–412.
Tomas Mikolov, Kai Chen, G.s Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
tations in vector space. Proceedings of Workshop at
ICLR, 2013, 01.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-
sos, D. Cournapeau, M. Brucher, M. Perrot, and
E. Duchesnay. 2011. Scikit-learn: Machine learn-
ing in Python. Journal of Machine Learning Re-
search, 12:2825–2830.
Radim Řehůřek and Petr Sojka. 2010. Software
Framework for Topic Modelling with Large Cor-
pora. In Proceedings of the LREC 2010 Workshop
on New Challenges for NLP Frameworks, pages 45–
50, Valletta, Malta, May. ELRA. http://is.
muni.cz/publication/884893/en.
Martina Astrid Rodda, Marco SG Senaldi, and
Alessandro Lenci. 2016. Panta rei: Tracking se-
mantic change with distributional semantics in an-
cient greek. In CLiC-it/EVALITA.
Tobias Schnabel, Igor Labutov, David Mimno, and
Thorsten Joachims. 2015. Evaluation methods for
unsupervised word embeddings. In Proceedings of
the 2015 Conference on Empirical Methods in Nat-
ural Language Processing, pages 298–307.